Short abstract
The Modified Kalman Filter approach for pooling information across time and outcomes is shown to improve accuracy in national estimates of health outcomes, including cancer, diabetes, and hypertension, especially in small racial/ethnic subgroups.
Abstract
The Modified Kalman Filter approach for pooling information across time and across outcomes is shown to improve accuracy in national estimates of health outcomes, including cancer, diabetes, and hypertension, especially in small racial/ethnic subgroups. The developed SAS macro models true health states in each subgroup assuming a linear time evolution and an autoregressive deviation around such trend. The macro provides multiple options for users.
There is much interest in using annual health outcomes' data to study racial/ethnic health disparities for both common racial/ethnic groups, such as non-Hispanic whites and blacks, and for rarer groups, such as American Indians/Alaska Natives (AI/AN), Asian and Hispanic subgroups. Examples of such health outcomes of interest include cancer, diabetes, hypertension, coronary heart disease (CHD), or average body mass index (BMI), which are frequently estimated from repeated cross-section samples of the U.S. population through annual health surveys, such as the National Health Interview Survey (NHIS). However, even large surveys like the NHIS typically provide only small annual samples of rarer subgroups, and the annual sample means can be very imprecise for these groups. For example, Table 1 provides 2004 NHIS estimates of the prevalence of stroke and diabetes for 11 racial/ethnic groups in the United States. The sampling errors are large for all but the most populous groups. For stroke, the relative standard error (SE divided by the prevalence or mean) exceeds 0.20 for eight of the eleven groups and exceeds 0.30 for six of the groups. Typically, estimates with relative standard errors exceeding 0.3 are considered unstable (Klein et al., 2002) and are commonly suppressed from any inference/making decision because of their lack of reliability.
Table 1.
Stroke and Diabetes Prevalence, Standard Errors and Relative Standard Errors from the 2004 NHIS Data and Correlation Between Stroke and Diabetes over Time (1997–2004)
2004 Prevalence of Stroke | 2004 Prevalence of Diabetes | Correlation | |||||
---|---|---|---|---|---|---|---|
Race/ethnicity | % | SE | Rel. SE | % | SE | Rel. SE | (1997–2004) |
White | 3.20 | 0.13 | 0.04 | 6.84 | 0.18 | 0.03 | 0.16 |
Black | 3.01 | 0.28 | 0.09 | 10.34 | 0.49 | 0.05 | 0.40 |
AI/AN | 3.77 | 1.44 | 0.38 | 16.04 | 2.76 | 0.17 | 0.49 |
Chinese | 2.10 | 1.06 | 0.51 | 6.34 | 1.91 | 0.30 | 0.03 |
Filipino | 3.09 | 1.40 | 0.45 | 8.28 | 1.98 | 0.24 | −0.32 |
Asian Indian | 0.70 | 0.69 | 1.00 | 10.47 | 2.43 | 0.23 | −0.17 |
Puerto Rican | 2.45 | 0.70 | 0.28 | 10.38 | 1.45 | 0.14 | −0.05 |
Mexican | 1.23 | 0.27 | 0.22 | 6.20 | 0.55 | 0.09 | 0.11 |
Cuban | 3.09 | 1.54 | 0.50 | 12.18 | 2.24 | 0.18 | −0.40 |
Other Hispanic | 2.22 | 0.29 | 0.13 | 7.59 | 0.54 | 0.07 | −0.04 |
All other | 1.93 | 0.64 | 0.33 | 5.19 | 1.09 | 0.21 | 0.12 |
NOTE: Relative standard error is the standard error divided by the prevalence or mean. The correlation is between the detrended diabetes and stroke prevalence.
As the magnitude of these standard errors is too large to meet the National Center for Health Statistics recommended standards for estimating health disparities (Klein et al., 2002), different methods have been proposed for better estimation of prevalence or means of interest (Lockwood et al., 2011). With population health outcomes generally evolving slowly over time, pooling data across years within groups provides an attractive means for improving the precision of the latest (current-year) annual estimates of disease prevalence and other health outcomes without increasing sample size. Co-morbid conditions can also be informative to disparity research in specific health outcomes. In a study of the disparity in diabetes between blacks and whites in the United States, Miller et al. (2004) reported that interventions addressing diabetes disparities should focus on managing co-morbidities, such as hypertension, shown to be related to the disparity. So, in the same manner, because of the clinical correlation between some health outcomes (e.g., diabetes and stroke), pooling data across outcomes and years simultaneously within groups can also help increase the precision of estimates. In the NHIS data in Table 1, even though from the same racial/ethnic groups, the relative standard errors of stroke in rarer racial/ethnic groups are above 0.30, making them unstable; the estimates for diabetes for the same groups have relative standard errors below 0.30. However, there is a significant correlation (time detrended) between stroke and diabetes in most racial/ethnic groups in the United States.
To improve precision, Elliott et al. (2009) developed a model called the Modified Kalman Filter (MKF), an extension of the Kalman filter estimation technique (Kalman, 1960) that assumes true health states in each racial/ethnic group evolve according to a group-specific linear trend and autoregressive (AR) deviations around that trend. They showed that the MKF is capable of improving the accuracy of health state estimates from such data as the NHIS. Lockwood et al. (2011) further extended the method to allow “borrowing information across groups” and Setodji et al. (2011) included information across correlated outcomes.
The MKF Procedure and MKF SAS© macro are designed to provide estimates of group means or prevalence rates from these different methods using data consisting of sample means and their standard errors from multiple time points within each of one or more groups. The MKF procedure pools data across time points within a group to improve the accuracy of the estimated mean for the final time point relative to the final period sample mean. When two outcomes are considered, where a correlation between those two outcomes can reasonably be assumed, this procedure also allows borrowing information from one outcome in the estimation of the other outcome. The sample means can be from simple random samples or complex survey designs.
The MKF macro models the sample means for any group as the unknown population mean plus an additive error term with variance given by user-supplied standard errors. The population mean is a function of a linear trend in time that describes the general progression of the outcome for the group and time period deviations from this trend. The goal of the software is to provide an accurate estimate of a population's means (unknown trends plus unobserved deviations from the trend) given the model and the observed time period means. The macro
estimates the model parameters;
uses the estimated parameters to produce an optimal weighted average of linear trend and current and past years' estimates.
Using this approach, the MKF procedure can yield substantial gains in accuracy of estimates for small groups relative to a single time period sample mean (Elliott et al. 2009; Lockwood et al. 2011). The MKF macro produces estimates of a population means and the error in those estimates (i.e., an estimate of the root mean squared error, RMSE, which is analogous to the standard error of the sample mean). When dealing with a single outcome, Lockwood et al. (2009) derived a Bayesian implementation of the procedure that proved to be robust and provided an accurate assessment of the error in the predicted population means. The MKF macro offers users the choice of using the Bayesian implementation (the default when doing the estimation for a single outcome) or alternative (maximum likelihood based) estimation methods. When dealing with two outcomes, model-averaging based on two maximum likelihood estimation assumptions is the default. The model-averaging technique used in this macro can be applied both to the Bayesian as well as the maximum likelihood approach (with a single outcome), but as the Bayesian estimation uses a less stringent time trend assumption, the model-averaging approach is implemented in the maximum likelihood approach only to deal with maximum likelihood limitations of the specific slope assumption. With small sample sizes, flexibility in the time trend assumption was not warranted with maximum likelihood.
This software macro is design to be used by analysts for estimation and assumes familiarity with SAS© software. For better a understanding of the methods used in the macro, users are encourage to read the articles the macro is based on, including Elliott et al. (2009), Lockwood et al. (2011) and Setodji et al. (2011). Note that the macro is available for use in SAS and is not written for other commonly used statistical software, such as STATA©, SPSS©, or SUDAN©.
Macro Components and Implementation
The MKF macro software includes all the files for using the software under the Windows© or Linux© (Unix©) operating systems. The main User Guide provides details for using the software with the Windows© operating system.*
This macro requires two files that need to be saved together in a directory or folder chosen by the user. The files are
MKF−MACRO.SAS, the file containing the SAS© macro code that conducts the analysis;
kfwindows.exe, an external executable file accessed by SAS© to conduct statistical computation for Bayesian model estimation for a single outcome.
Users will need to refer to this directory via the software directory (software_dir) macro parameter when implementing the MKF macro in SAS©. The macro creates temporary files in the system TEMP directory that are deleted after the macro is terminated.
Macro Features
The following are some of the basic features of the MKF macro:
works with any type of group mean outcomes;
works with any number of time periods greater than three;
works with one or two outcomes: for one outcome, information across time and groups is pooled; for two outcomes, information across time, outcomes (correlation), and groups is pooled;
allows the user to specify the directory where the macro is stored and group, time, outcome, and standard error variable names;
allows users to choose multiple subset or subgroup analyses; either Bayesian or maximum likelihood estimation methods or both; different specifications for time trends across groups (group-specific, common, or no time trends); output specification;
saves details of the statistical modeling to SAS© data sets that can be manipulated and saved by users.
Data Requirements
The MKF estimation method uses group means and their associated standard errors. The group means are group (e.g., racial/ethnic) averages or prevalence that can be estimated from personal-level data over time, and the standard errors can also be estimated from personal-level data. The macro only allows for input of user's computed group means and standard errors, and as a first step before the use of the macro, using SAS or other statistical software, users should estimate these group means and standard errors, taking into account complex designs (e.g., sampling weights) when necessary before inputting them in the macro for estimation. The specific requirements of the macro are as follows:
The data should consist of one record for each of G groups measured at each of T time points for a total of G × T records.
Every record must include a value for a group identifier variable to identify the G groups.
The group identifier can be either a character or numeric variable.
The time period must be numeric and equally spaced. For example, times could be t1 = 1, t2 = 2, etc., or t1 = 1998, t2 = 2000, t3 = 2002, t4 = 2004, etc., where the measurements are two years apart, but times could not be t1 = 1998, t2 = 2000, t3 = 2002, t4 = 2003, where from t1 to t2 or t2 to t3 there is a two-year span but between t3 and t4 there is only a one-year span.
The outcome of interest Ygt (and Xgt when dealing with two outcomes), g = 1, …, G, t = 1, …, T, can take any real value and will typically consist of group means or prevalence rates from samples of a population or subpopulation of interest.
The outcome data must be complete, with no missing value for any group or time period.
The data must contain standard errors (SEYgt ≥ 0) for each group estimated at each time point, with no missing values allowed, for all the outcomes of interest.
An SEYgt = 0 means that zero variance was observed in group g at time t, or that Ygt for each member of the subpopulation g was the same. For each group, SEYgt (and SEXgt if the interest is in two outcomes) must be greater than zero for at least one time period for the macro to produce population mean estimates. No variation within a group at a given time period might occur for rare diseases and small samples in which no cases in the sample have been observed with the disease.
Because missing data are not allowed in the macro, users who need to deal with missing values are encouraged to use missing data imputation techniques to fill in missing values before using the macro.
Notes
The MKF macro will work with either the Windows© or Linux© (Unix©) operating systems without any changes to the SAS© macro code. However, the macro accesses an external executable file to conduct some of the statistical computations and the installation of this file is operating-system-dependent.
Reference
- Elliott M. N., McCaffrey D. F., Finch B. K., Klein D. J., Orr N., Beckett M. N., Lurie N. (2009). “Improving Disparity Estimates for Rare Racial/Ethnic Groups with Trend Estimation and Kalman Filtering: An Application to the National Health Interview Survey.” Health Services Research, 44 (5), 1622–1639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klein R. J., Proctor S. E., Boudreault M. A., Turczyn K. M. (2002). Healthy People 2010 Criteria for Data Suppression. Statistical Notes, No. 24. Hyattsville, Maryland: National Center for Health Statistics. [PubMed] [Google Scholar]
- Kalman R. (1960). “A New Approach to Linear Filtering and Prediction Problems.” Transactions of the ASME Journal of Basic Engineering 82, 35–45. [Google Scholar]
- Lockwood J. R., McCaffrey D. F., Setodji C. M. andElliott M. N. (2011). “Smoothing Across Time in Repeated Cross-Sectional Data.” Statistics in Medicine 30 (5): 584–594. [DOI] [PubMed] [Google Scholar]
- Miller S. T., Schlundt D. G., Larson C., Reid R., Pichert J. W., Hargreaves M., Brown A., McClellan L., andMarrs M. (2004). “Exploring Ethnic Disparities in Diabetes, Diabetes Care, and Lifestyle Behaviors: the Nashville REACH 2010 Community Baseline Survey.” Ethnicity and Disease 14 (3 Suppl 1): 3845. [PubMed] [Google Scholar]
- Setodji C. M., Adams J. L., McCaffrey D. F., Elliott M. N., andRoary M. (2011). “Borrowing Strength Across Time and Outcomes in Repeated Cross-Sectional Data.” Manuscript under preparation.