Abstract
Objective
To improve on existing methods to infer race/ethnicity in health care data through an analysis of birth records from Connecticut.
Data Source
A total of 162 467 Connecticut birth records from 2009 to 2013.
Study Design
We developed a logistic model to predict race/ethnicity using data from US Census and patient‐level information. Model performance was tested and compared to previous studies. Five performance measures were used for comparison.
Principal Findings
Our full model correctly classifies 81 percent of subjects and shows improvement over extant methods. We achieved substantially improved sensitivity in predicting black race.
Conclusions
Predictive models using Census information and patients’ demographic characteristics can be used to accurately populate race/ethnicity information in health care databases, enhancing opportunities to investigate and address disparities in access to, utilization of, and outcomes of care.
Keywords: health disparities, health insurance claims, imputation, missing data
1. INTRODUCTION
All‐payer claims databases (APCDs)1 currently established or information in 25 states, offer unprecedented opportunities to analyze access to, utilization of, and outcomes of medical care. The lack of patient‐reported information on patients’ race and ethnicity, however, is a major shortcoming of virtually all APCDs in the United States. APCDs contain race and ethnic information on approximately 3 percent of commercially insured beneficiaries.2 This shortcoming renders APCDs useless for analysis of racial and ethnic disparities in the utilization and outcomes of care. With the substantial body of evidence showing higher rates of chronic disease3, 4, 5, 6 and poorer treatment outcomes7, 8, 9, 10, 11, 12 among patients of color in the United States, the lack of race and ethnic information in medical claims repositories constitutes a significant missed opportunity for monitoring and addressing health disparities.
Over the past two decades, there have been a number of studies attempting to demonstrate the utility of indirect methods of inferring or assigning race and ethnicity to patient health information.13, 14 Geocoding using Census information and ethnic surname dictionaries have been the two major indirect sources of data for inferring race/ethnicity. Elliott et al15 and Elliott et al16 developed Bayesian Surname and Geocoding (BSG) and Bayesian Improved Surname and Geocoding (BISG) methods for estimating race/ethnicity and demonstrated the superiority of these methods compared to two prior approaches using surname information combined with Census block information in estimating overall race/ethnicity prevalence and assigning race and ethnic status to individual patients. The BSG/BISG determines the individuals’ race/ethnicity estimates based on the race/ethnicity distribution in the Census block. It then applies Bayes’ formula to update the probability distribution, provided that the individual's surname appears on surname lists for Asians or Hispanics. The BSG uses surname lists collected from the population itself, while the BISG uses surname lists provided by the Census Bureau. Both BSG and BISG have now been used in several studies with diverse health care data.17, 18, 19, 20
In order to mimic the data structure of an APCD while having complete race and ethnic information, we use statewide birth registry records. We use the birth registry to explore two potential opportunities for improving the accuracy of predictive models in assigning race/ethnic characteristics to individuals in datasets where this information is missing or unreliable, using only limited nonmissing information. First, we gathered the distribution of race/ethnicity in Connecticut Census tracts21 and the national race/ethnicity proportions of each surname family from 2010 US Census data.22 Unlike in Elliott et al,15 where each surname is associated with Asian (AS) and Hispanic (HS) indicators for being in the Lauderdale‐Kestenbaum Asian Surname List23 or Census Bureau Spanish Surname List,24 our methodology associates a surname with a vector of probabilities for being in each race category. Second, we include an expanded set of predictors, such as insurance coverage and mother's age, to determine the potential benefits of including ancillary characteristics in the prediction model. We then compare these results to those obtained using existing approaches (eg, BSG/BISG) to inferring race and ethnic characteristics.
2. METHODS
2.1. Measures
Since APCDs are usually missing the vast majority of race identification, and we need complete data to evaluate our methodology, we obtained birth records from the Connecticut Department of Public Health for 2009‐2013 for this analysis. In these data, the race is observed to all patients and we will be able to compare our predicted race to the true race. Parents and children appearing in the birth registry were matched to Connecticut Census tracts based on their residential address resulting in 162 467 observations residing in 827 tracts. There were 19 286 subjects with invalid addresses or out‐of‐state addresses were removed from the analysis.
For each child, race and ethnicity were defined as the self‐reported race and ethnic status of the mother. This is based on CDC guidelines, the procedure carried in CT, and common procedure in the field.25, 26 Race was coded in birth records as white, black, Indian, Chinese, Japanese, Hawaiian, Filipino, other Asian, and other race categories. Hispanic ethnicity was reported separately. These data were recoded into four categories for our analysis: white non‐Hispanic, black non‐Hispanic, Hispanic, and other. Also, each mother's surname was associated with a vector of percentages for being in each of these four categories from the 2010 Census. For example, “Smith” gets (71, 23, 2, and 4 percent) and “Nguyen” gets (1, 0, 1, and 98 percent) for white non‐Hispanic, black non‐Hispanic, Hispanic, and other, respectively. Other covariates included in the model were mother's age at delivery, insurance type (private insurance, self/no insurance, Medicaid, other insurance), and whether the father was missing on the birth certificate. These covariates represent other available information which can be used in the imputation procedure. Finally, the race and ethnic distribution of Connecticut residents in the state's 828 Census tracts in 2010 was included as percentages and coded consistent with the race and ethnic categories describe above.
2.2. Models
We fitted two multinomial logistic regression models: CT‐based full (CTBF) model and CT‐based reduced (CTBR) model. The full model used all available information, including the 2010 Census tract race/ethnicity percentages, surname race percentages, insurance types, dummy variable for father missing, and mothers’ age, while the reduced model did not include the last three demographic variables. The Census and surname distributions are presented using percentages for ease of interpretation. The dataset was partitioned into two equally sized training and testing datasets. The models were fitted using 5 percent of the training data, which mimics the reality that only 3‐5 percent of race/ethnicity is known for claims in APCDs. After obtaining parameter estimates for all the measures in the model based on the training data, we predict race/ethnicity for individuals in the testing data, compare the predicted race with the true race, and evaluated the results using multiple measures.
2.3. Performance measures
Two performance metrics in Elliott et al15 were used to assess model performance. The first measures the accuracy of the predicted race/ethnicity distribution in the testing data using the weighted average of the error of the four racial prevalence estimates. Specifically, denoting the true prevalence in the testing dataset as (p w, p b, p h, p o) for whites, blacks, Hispanics, and others, respectively, and the predicted prevalence for the testing dataset as (q w, q b, q h, q o) we calculate:
The second measure evaluates the accuracy of individuals’ predicted race/ethnicity using the weighted correlation of the individuals’ true race/ethnicity and their predicted race/ethnicity. The weights were chosen to be the true prevalence of race/ethnicity in the testing dataset. For example, for all observations in the testing dataset, we have a vector r wt of 0 and 1's indicating if individuals were white (0 = No, 1 = Yes). After obtaining predicted race/ethnicity, we have a second vector r wp indicating if individuals were predicted to be white. Similarly, we generated r bt, r bp, r ht, r hp, r ot, and r op for blacks, Hispanics, and others. Then, we denote the correlation coefficient between r wt and r wp as corr w, and calculate:
We also supplemented the measures described above with four common measures of accuracy: sensitivity and specificity,27 Cohen's κ,28 and percentage of correct predictions.
We also compared the performance of our models to results obtained from the application of the BSG/BISG15, 16 model to our data. Since we did not have access to their data, the comparison was done based on our data. The formulas of the BSG were used, but with an updated surname list from the 2010 Census instead of the 2000 Census, therefore we are essentially comparing to a combination of BSG and BISG. We adapted encoding of the CT birth records data to the race/ethnicity categories in Elliott et al:15 white/other, black, Hispanic, and Asian. With our data, using the 2010 Census surname list, we created two indicators for the testing dataset, AS and HS, for the surname family have the highest percentage of Hispanics or Asians, respectively. We then obtained the BSG posterior probabilities and predicted races/ethnicities. Next, we again partitioned the data into equally sized training and testing datasets, and fitted our CTBF and CTBR models on 5 percent of the training data. Using the fitted parameters, we obtained race/ethnicity predictions for observations in the testing dataset. We applied the same performance measures on the outcomes.
3. RESULTS
Table 1 summarizes the distribution of race/ethnicity and other demographic variables in CT birth records from 2009 to 2013. Whites comprised the majority of the sample (57 percent), followed by Hispanics (22 percent), blacks (13 percent), and other race (8 percent). Sixty percent of the sample was privately insured, with 35 percent covered by Medicaid. While 89 percent had fathers name present on the birth certificate, mothers’ mean age at birth was 29.9 (SD = 6.1).
Table 1.
Selected demographic characteristics from Connecticut birth records from 2009 to 2013 (N = 162 467)
| Percent | Number | |
|---|---|---|
| White | 57 | 92 843 |
| Black | 13 | 20 738 |
| Hispanic | 22 | 36 350 |
| Other | 8 | 12 896 |
| Total | 100 | 162 467 |
| Private | 60 | 97 154 |
| Medicaid | 35 | 56 603 |
| None/Self | 4 | 6342 |
| Other insurance | 1 | 2368 |
| Total | 100 | 162 467 |
| Father missing | 11 | 18 129 |
| Father not Missing | 89 | 139 838 |
| Total | 100 | 162 467 |
In Table 2, we present the regression parameters, their standard errors, and the Z test statistics for the CTBR and CTBF models based on 5 percent of the training data. Whites are used as the reference category, and the parameters describe the relative probabilities of being black, Hispanic, or others vs being white associated with each predictor variable. Results show that living in a Census tract with a higher percentage non‐whites was significantly associated with being black or Hispanic. Missing father's information in the birth record was also significantly associated with black race. Compared to the privately insured, Medicaid beneficiaries are significantly more likely to be black or Hispanic, while having no insurance or having other unspecified insurance was associated with Hispanic ethnicity. Although mother's age was not informative for blacks, babies having younger mothers tend to be Hispanic or other race. Finally, names associated with particular race and ethnic groups were significant predictors of an individual's race/ethnicity: surnames more likely to be associated with blacks strongly predictive black race; Hispanic surnames were predictive of Hispanic ethnicity; and names associated with other race categories were predictive of other race.
Table 2.
Parameter estimates, 95% confidence intervals of parameter estimates, and p‐values for CTBR and CTBF, fitted using 5% of the training data (n = 4062)
| Race | Variable | CTBR | CTBF | ||||
|---|---|---|---|---|---|---|---|
| β | 95% CI | Pr(>|Z|) | β | 95% CI | Pr(>|Z|) | ||
| Black | (Intercept) | −4.846 | [−5.226, −4.465] | <0.001 | −5.048 | [−5.897, −4.199] | <0.001 |
| Percent of blacks in tract | 0.068 | [0.059, 0.077] | <0.001 | 0.064 | [0.054, 0.074] | <0.001 | |
| Percent of Hispanics in tract | 0.045 | [0.037, 0.053] | <0.001 | 0.036 | [0.028, 0.044] | <0.001 | |
| Percent of others in tract | 0.078 | [0.050, 0.106] | <0.001 | 0.080 | [0.051, 0.109] | <0.001 | |
| Medicaid insurance | 0.925 | [0.625, 1.225] | <0.001 | ||||
| Self‐pay, no insurance | 0.446 | [−0.244, 1.136] | 0.205 | ||||
| Other insurance | 0.733 | [−0.343, 1.813] | 0.181 | ||||
| Mother age | 0.001 | [−0.025, 0.023] | 0.908 | ||||
| Father missing | 0.844 | [0.481, 1.207] | <0.001 | ||||
| Percent of blacks in surname | 0.064 | [0.056, 0.072] | <0.001 | 0.061 | [0.053, 0.069] | <0.001 | |
| Percent of Hispanics in surname | −0.008 | [−0.016, 0.001] | 0.106 | −0.009 | [−0.017, −0.001] | 0.055 | |
| Percent of others in surname | 0.009 | [−0.004, 0.022] | 0.157 | 0.010 | [−0.004, 0.024] | 0.127 | |
| Hispanic | (Intercept) | −3.606 | [−3.900, −3.312] | <0.001 | −3.285 | [−4.006, −2.564] | <0.001 |
| Percent of blacks in tract | 0.034 | [0.024, 0.043] | <0.001 | 0.03 | [0.020, 0.04] | <0.001 | |
| Percent of Hispanics in tract | 0.050 | [0.043, 0.057] | <0.001 | 0.038 | [0.03, 0.046] | <0.001 | |
| Percent of others in tract | 0.037 | [0.015, 0.069] | 0.002 | 0.044 | [0.017, 0.071] | 0.002 | |
| Medicaid insurance | 1.098 | [0.829, 1.367] | <0.001 | ||||
| Self‐pay, no insurance | 1.525 | [1.043, 2.007] | <0.001 | ||||
| Other insurance | 2.395 | [1.729, 3.061] | <0.001 | ||||
| Mother age | −0.022 | [−0.042, −0.002] | 0.033 | ||||
| Father missing | 0.260 | [−0.097, 0.617] | 0.152 | ||||
| Percent of blacks in surname | 0.002 | [−0.009, 0.013] | 0.666 | 0.002 | [−0.010, 0.014] | 0.746 | |
| Percent of Hispanics in surname | 0.039 | [0.036, 0.042] | <0.001 | 0.038 | [0.034, 0.042] | <0.001 | |
| Percent of others in surname | −0.008 | [−0.025, 0.009] | 0.364 | −0.007 | [−0.025, 0.011] | 0.449 | |
| Other | (Intercept) | −4.076 | [−4.431, −3.72] | <0.001 | −3.155 | [−4.168, −2.142] | <0.001 |
| Percent of blacks in tract | 0.023 | [0.008, 0.038] | 0.002 | 0.023 | [0.007, 0.039] | 0.004 | |
| Percent of Hispanics in tract | 0.016 | [0.003, 0.029] | 0.013 | 0.003 | [−0.002, 0.026] | 0.070 | |
| Percent of others in tract | 0.105 | [0.078, 0.132] | <0.001 | 0.105 | [0.078, 0.132] | <0.001 | |
| Medicaid insurance | 0.168 | [−0.22, 0.556] | 0.396 | ||||
| Self‐pay, no insurance | −0.039 | [−0.954, 0.876] | 0.934 | ||||
| Other insurance | 0.474 | [−0.825, 1.773] | 0.475 | ||||
| Mother age | −0.030 | [−0.059, −0.001] | 0.044 | ||||
| Father missing | 0.328 | [−0.229, 0.885] | 0.248 | ||||
| Percent of blacks in surname | 0.005 | [−0.009, 0.019] | 0.475 | 0.003 | [−0.011, 0.017] | 0.688 | |
| Percent of Hispanics in surname | 0.006 | [−0.005, 0.011] | 0.467 | 0.002 | [−0.006, 0.01] | 0.590 | |
| Percent of others in surname | 0.056 | [0.049, 0.062] | <0.001 | 0.056 | [0.050, 0.062] | <0.001 | |
To evaluate our models, we first compared the results of the imputation model with observed race and ethnicity in the testing data. We present comparisons between self‐reported race/ethnicity and imputed race/ethnicity for each race and ethnic category based on the full and reduced models. In addition, we present the sensitivities and specificities of our models, Cohen's kappa, and the weighted correlation and deviation measures as defined above. We also present the percentage of observations in the testing data that were correctly predicted. The model performance measures are summarized in Table 3. Both models overestimated the number of whites, while underestimating the other three categories, as reflected in the high sensitivity for whites and the low sensitivities for blacks, Hispanics, and others. By comparing full and reduced models, we see that including the demographic variables helped identify more blacks and Hispanics, as the sensitivity of CTBF for blacks and Hispanics was higher than that for CTBR. Addition of the demographic variables did not provide much assistance in predicting other race.
Table 3.
Comparing imputed race and ethnicity to self‐reported race and ethnicity for testing subset of CT birth records (n = 81 234)
| Self‐reported, n | |||||
|---|---|---|---|---|---|
| White | Black | Hispanic | Other | Total | |
| CT‐based reduced model (a) | |||||
| Imputed, n | |||||
| White | 42 461 | 3616 | 4483 | 2776 | 53 336 |
| Black | 1259 | 6236 | 805 | 371 | 8671 |
| Hispanic | 2168 | 373 | 12 944 | 259 | 15 744 |
| Other | 324 | 49 | 65 | 3045 | 3483 |
| Total | 46 212 | 10 274 | 18 297 | 6451 | 81 234 |
| Sensitivity, % | 92 [91.6, 92.1] | 61 [59.8, 61.6] | 71 [70.1, 71.4] | 47 [46.0, 48.4] | |
| Specificity, % | 69 [68.5, 69.4] | 97 [96.4, 96.7] | 96 [95.4, 95.7] | 99 [99.4, 99.5] | |
| Lower | Estimate | Upper | |||
| Cohen's kappa | 0.64 | 0.64 | 0.64 | ||
| Weighted error | 0.062 | ||||
| Correlation | 0.646 | ||||
| Correct rate, % | 79.6 | ||||
| CT‐based full model (b) | |||||
| Imputed, n | |||||
| White | 42 644 | 3384 | 4221 | 2788 | 53 037 |
| Black | 1196 | 6453 | 780 | 385 | 8814 |
| Hispanic | 2041 | 383 | 13 229 | 233 | 15 886 |
| Other | 331 | 54 | 67 | 3045 | 3497 |
| Total | 46 212 | 10 274 | 18 297 | 6451 | 81 234 |
| Sensitivity, % | 92 [92.0, 92.5] | 63 [61.9, 63.7] | 72 [71.7, 72.9] | 47 [46.0, 48.4] | |
| Specificity, % | 70 [69.8, 70.8] | 97 [96.5, 96.8] | 96 [95.6, 95.9] | 99 [99.3, 99.5] | |
| Lower | Estimate | Upper | |||
| Cohen's kappa | 0.65 | 0.66 | 0.66 | ||
| Weighted error | 0.060 | ||||
| Correlation | 0.662 | ||||
| Correct rate, % | 80.5% | ||||
The reduced model (a) included as predictors only the distribution of race/ethnicity in tracts and in surnames; the full model (b) used additional demographic characteristics, which included mother's age, dummy variable for father missing, and insurance payment types. Bold values in the tables are the number of correctly predicted observations in each race/ethnicity group.
The results presented in Table 4 indicate that the CT‐based methods had better performance compared to the combined BSG/BISG method in terms of almost every measure, with the only exception being <2 percentage point difference in sensitivity for white/other and <1 percentage point difference in specificity for black and Hispanic. In particular, the correlation between predicted and actual race/ethnicity based on our full model is 0.668, compared to 0.597 using the BSG/BISG. It is important to note that the main improvement associated with our model was derived from greatly improved prediction of black race, as reflected in the much lower sensitivity of the BSG/BISG relative to the CTBF (39 vs 63 percent).
Table 4.
Comparison of performance for CTBR, CTBF, and the BSG, using race encodings of Elliott et al 2008
| CTBR | CTBF | BSG/BISG | |
|---|---|---|---|
| Weighted error | 0.051 | 0.048 | 0.089 |
| Correlation | 0.595 | 0.668 | 0.597 |
| Sensitivity (%) | |||
| White/Other | 91 [90.7, 91.3] | 91 [90.7, 91.3] | 93 [92.8, 93.2] |
| Black | 60 [59.1, 60.9] | 63 [62.1, 63.9] | 39 [38.1, 39.9] |
| Hispanic | 71 [70.3, 71.6] | 72 [71.4, 72.7] | 66 [65.3, 66.7] |
| Asian | 59 [57.8, 60.2] | 59 [57.8, 60.2] | 57 [55.8, 58.2] |
| Specificity (%) | |||
| White/Other | 70 [69.5, 70.5] | 72 [71.5, 72.5] | 61 [60.5, 61.5] |
| Black | 97 [96.9, 97.1] | 96 [95.9, 96.1] | 97 [96.9, 97.1] |
| Hispanic | 96[95.8, 96.2] | 96 [95.8, 96.2] | 97 [96.9, 97.1] |
| Asian | 99 [98.9, 99.1] | 99 [98.9, 99.1] | 99 [98.9, 99.1] |
| Cohen's kappa | 0.65 | 0.67 | 0.58 |
| % of predictions correct | 81 | 81 | 78 |
4. CONCLUSIONS
In this study, we present an improved method for imputing race and ethnicity in health care data. Our reduced model, which was similar to the state of the science approaches put forth by Elliott and colleagues,15, 16 showed substantially improved sensitivity in predicting black and to a lesser extent Hispanic race/ethnicity characteristic by using a full probability vector for the race and ethnic surname dictionary as opposed to binary coding for Hispanics and others—Asians, specifically. Our full model, which incorporated a handful of demographic characteristics representative of those typically available in health care data, showed further improvement in sensitivity in prediction of black race and specificity in predicting white race. Both models were easy to implement and interpret, required only 5 percent observed race, and quantified the influence of each predictor. While not every demographic characteristic can be found in other claims databases where the models might apply, our study provides a general modeling strategy of incorporating ancillary information to increase prediction accuracy.
Our study had several limitations. First, it utilized the distribution of race among surnames on the national level using Census information, which could differ from that observed in Connecticut. This, however, could be addressed using a Connecticut‐specific surname distribution. Second, the prediction of those in other race categories is not as accurate as that for whites, blacks, and Hispanics, which highlights the need for additional ancillary information, such as voter registration and registry of motor vehicles information, that could improve the identification of less common racial and ethnic minority groups. With more observations in the training data, we might also be able to improve model performance by using block level distributions of race and ethnicity instead of tract level, as Census blocks (N = 67 578 in Connecticut) provide a more fine‐grained characterization of race and ethnicity in a geographic area than tracts (N = 829 in Connecticut).
Despite these limitations, this study has shown that such models can be particularly useful in predicting race and ethnic characteristics in a sample when only a small portion of observations have reliable information on these characteristics, which is currently the case in virtually all statewide medical claims repositories in the United States. Future research should focus on accessing and integrating additional information to further increase the predictive power of the model.
Supporting information
ACKNOWLEDGMENTS
Joint Acknowledgment/Disclosure Statement: The authors would like to thank Marc Elliot for his insightful comments.
None of the authors have a proprietary interest. Funding for this research was provided through a grant from the Centers for Medicare and Medicaid Services to the state of Connecticut (1G1CMS331630‐02‐00). Data for this study were obtained from the Connecticut Department of Public Health. The authors assume full responsibility for all analyses, interpretations and conclusions. The Connecticut Department of Public Health does not endorse or assume any responsibility for any analyses, interpretations or conclusions based on the data. The authors have no other disclosures.
Xue Y, Harel O, Aseltine RH. Imputing race and ethnic information in administrative health data. Health Serv Res. 2019;54:957‐963. 10.1111/1475-6773.13171
REFERENCES
- 1. APCD Council. https://www.apcdcouncil.org/state/map. Accessed November 29, 2017.
- 2. Center for Health Information and Analysis . All‐Payer Claims Database (MA APCD) Release 3.0 Documentation Guide. http://www.chiamass.gov/assets/Uploads/apcd-3-0/release-3-0-Member-Eligibility-File-Documentation-Guide.pdf. 2015. Accessed November 29, 2017.
- 3. Liao Y, Bang D, Cosgrove S, et al. Surveillance of health status in minority communities – Racial and Ethnic Approaches to Community Health Across the U.S. (REACH U.S.) Risk Factor Survey, United States, 2009. MMWR Surveill Summ. 2011;60(6):957‐44. [PubMed] [Google Scholar]
- 4. Harris R, Nelson LA, Muller C, Buchwald D. Stroke in American Indians and Alaska natives: a systematic review. Am J Public Health. 2015;105(8):16‐26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. CMS OMH and NORC . Racial and Ethnic Disparities in Diabetes Prevalence, Self‐Management, and Health Outcomes among Medicare Beneficiaries. CMS OMH Data Highlight No. 6. Baltimore, MD. 2010.
- 6. Graham G. Disparities in cardiovascular disease risk in the United States. Curr. Cardiol. Rev. 2015;11(3):238‐245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Jiang HJ, Andrews R, Stryer D, Friedman B. Racial/ethnic disparities in potentially preventable readmissions: the case of diabetes. Am J Public Health. 2005;95:1561‐1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Kim H, Ross JS, Melkus GD, Zhao Z, Boockvar K. Unscheduled hospital readmissions among patients with diabetes. Am J Manag Care. 2010;16(10):760‐767. [PMC free article] [PubMed] [Google Scholar]
- 9. Joynt KE, Orav EJ, Jha AK. Thirty‐day readmission rates for Medicare beneficiaries by race and site of care. J Am Med Assoc. 2011;305:675‐681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. McBean AM, Li S, Gilbertson DT, Collins AJ. Differences in diabetes prevalence, incidence, and mortality among the elderly of four racial/ethnic groups: Whites, blacks, Hispanics, and Asians. Diabetes Care. 2004;27(10):2317‐2324. [DOI] [PubMed] [Google Scholar]
- 11. McHugh MD, Carthon JM, Kang XL. Medicare readmissions policies and racial and ethnic health disparities: a cautionary tale. Policy Polit Nurs Pract.. 2010;11:309‐316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Alexander M, Grumbach K, Remy L, et al. Congestive heart failure hospitalizations and survival in California: patterns according to race/ethnicity. Am Heart J. 1999;137:919‐927. [DOI] [PubMed] [Google Scholar]
- 13. National Research Council (US) Panel on DHHS Collection of Race and Ethnic Data , Ver PM, Perrin E, eds. Eliminating Health Disparities: Measurement and Data Needs. Washington, DC: National Academies Press (US); 2004. Appendix G, Racial and Ethnic Data Collection by Health Plans. https://www.ncbi.nlm.nih.gov/books/NBK215750/. [PubMed] [Google Scholar]
- 14. Fiscella K, Fremont A. Use of geocoding and surname analysis to estimate race and ethnicity. Health Serv Res. 2006;41(4p1):1482‐1500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Elliott MN, Morrison PA, Fremont A, Pantoja P, Lurie N. A new method for estimating race/ethnicity and associated disparities where administrative records lack self‐reported race/ethnicity. Health Serv Res. 2008;43(5p1):1722‐1736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Elliott MN, Morrison PA, Fremont A, McCaffrey DF, Pantoja P, Lurie N. Using the census bureau's surname list to improve estimates of race/ethnicity and associated disparities. Health Serv Outcomes Res Method. 2009;9(2):69‐83. [Google Scholar]
- 17. Aci M, Inan C, Avci M. A hybrid classification method of K nearest neighbor, Bayesian methods and genetic algorithm. Expert Syst Appl. 2010;37(7):5061‐5067. [Google Scholar]
- 18. Adjaye‐Gbewonyo D, Bednarczyk RA, Davis RL, Omer SB. Using the Bayesian improved surname geocoding method (bisg) to create a working classification of race and ethnicity in a diverse managed care population: a validation study. Health Serv Res. 2014;49(1):268‐283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Storey P, Murchison AP, Dai Y, et al. Comparing methodologies for imputing ethnicity in an urban ophthalmology clinic. Ophthalmic Epidemiol. 2014;21(2):106‐110. [DOI] [PubMed] [Google Scholar]
- 20. Fremont A, Weissman JS, Hoch E, Elliott MN. When race/ethnicity data are lacking: using advanced indirect estimation methods to measure disparities. Rand Health Quarterly. 2016; 6( 1). https://www.rand.org/pubs/research_reports/RR1162.html. [PMC free article] [PubMed] [Google Scholar]
- 21. Connecticut State Data Center . 2010 Census redistricting data and shapefiles (Public Law 94‐171). https://ctsdc.uconn.edu/connecticut_census_data/#2010_redistricting
- 22. Comenetz J. Frequently occurring surnames in the 2010 Census. 2016. https://www2.census.gov/topics/genealogy/2010surnames/surnames.pdf. Data available from: https://raw.githubusercontent.com/cfpb/proxy-methodology/master/input_files/Names_2010Census.csv
- 23. Lauderdale D, Kestenbaum BB. Asian American ethnic identification by surname. Popul Dev Rev.. 2000;19(3):283‐300. [Google Scholar]
- 24. Perkins RC. Evaluating the passel‐word Spanish surname list: 1990 decennial census post enumeration survey results. U.S. Census Bureau, Population Division. 1990.
- 25. Mason LR, Nam Y, Kim Y. Validity of infant race/ethnicity from birth certificates in the context of U.S. demographic change. Health Serv Res. 2013;49(1):249‐267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. NCHS (National Center for Health Statistics) National Health Interview Survey 2002. 2005. http://www.cdc.gov/nchs/nhis.htm. Accessed November 1, 2018.
- 27. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Section 2.1. New York, NY: Oxford University Press; 2003. [Google Scholar]
- 28. Cohen JA. A coefficient of agreement for nominal scales. Educ Psychol Measur. 1960;20:37‐46. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
