Skip to main content
Health Services Research logoLink to Health Services Research
. 2023 Mar 28;58(5):1119–1130. doi: 10.1111/1475-6773.14154

Disaggregating Latino nativity in equity research using electronic health records

Miguel Marino 1,2,, Katie Fankhauser 1,3, Jessica Minnier 2, Jennifer A Lucas 1, Sophia Giebultowicz 4, Jorge Kaufmann 1, Jun Hwang 1, Steffani R Bailey 1, Danielle M Crookes 5, Andrew Bazemore 6, Shakira F Suglia 7, John Heintzman 1,4
PMCID: PMC10480087  PMID: 36978286

Abstract

Objective

To develop and validate prediction models for inference of Latino nativity to advance health equity research.

Data Sources/Study Setting

This study used electronic health records (EHRs) from 19,985 Latino children with self‐reported country of birth seeking care from January 1, 2012 to December 31, 2018 at 456 community health centers (CHCs) across 15 states along with census‐tract geocoded neighborhood composition and surname data.

Study Design

We constructed and evaluated the performance of prediction models within a broad machine learning framework (Super Learner) for the estimation of Latino nativity. Outcomes included binary indicators denoting nativity (US vs. foreign‐born) and Latino country of birth (Mexican, Cuban, Guatemalan). The performance of these models was compared using the area under the receiver operating characteristics curve (AUC) from an externally withheld patient sample.

Data Collection/Extraction Methods

Census surname lists, census neighborhood composition, and Forebears administrative data were linked to EHR data.

Principal Findings

Of the 19,985 Latino patients, 10.7% reported a non‐US country of birth (5.1% Mexican, 4.7% Guatemalan, 0.8% Cuban). Overall, prediction models for nativity showed outstanding performance with external validation (US‐born vs. foreign: AUC = 0.90; Mexican vs. non‐Mexican: AUC = 0.89; Guatemalan vs. non‐Guatemalan: AUC = 0.95; Cuban vs. non‐Cuban: AUC = 0.99).

Conclusions

Among challenges facing health equity researchers in health services is the absence of methods for data disaggregation, and the specific ability to determine Latino country of birth (nativity) to inform disparities. Recent interest in more robust health equity research has called attention to the importance of data disaggregation. In a multistate network of CHCs using multilevel inputs from EHR data linked to surname and community data, we developed and validated novel prediction models for the use of available EHR data to infer Latino nativity for health disparities research in primary care and health services research, which is a significant potential methodologic advance in studying this population.

Keywords: ethnicity, health disparities, machine learning, surname data, U.S. Census location


What is known on this topic

  • Broad ethnic categories whose capture is required by federal agencies can mask variation within Latino populations, limiting the ability to target resources/interventions where they are needed most.

  • In health services research, a specific country of birth is considered to be an important determinant of health outcomes and possibly of healthcare utilization for Latinos living in the US.

  • Improving how we collect, investigate, and utilize ethnicity‐disaggregated data is central to health equity research, but there are numerous challenges to doing this safely, ethically, and effectively.

What this study adds

  • Using multilevel data from ~20,000 Latino children with known country of birth, we developed and validated (with good accuracy) novel prediction models to infer Latino nativity.

  • The best‐performing algorithm incorporated the broadest amount of information, including patient‐level EHR data in combination with surname and community‐level subgroup information.

  • Although prediction algorithms are imperfect and cannot replace self‐reported nativity, this approach may allow researchers to understand how more widespread collection of this data from patients may be most useful.

1. INTRODUCTION

Researchers, health systems, and policy makers have highlighted the importance of racial and ethnic health equity research 1 while acknowledging the limitations of currently used broad racial/ethnic categories (e.g., Latino or non‐Latino) when investigating and attempting to ameliorate health disparities. 2 , 3 The broad racial/ethnic categories whose capture is currently required by the Office of Management and Budget for all federally collected data 4 can mask significant variation within Latino categories, limiting the ability to target resources and interventions where they are needed most. 5 The need for more robust health equity research has led multiple organizations (e.g., NIH, Congress via the Affordable Care Act, etc.) to call for increased data disaggregation (i.e., breaking down data by more granular key characteristics in order to understand the most salient characteristics to health) in clinical research. 5 , 6 , 7 Improving how we collect, investigate, and respond to disaggregated data on ethnicity is central to the pursuit of health equity.

This may be especially applicable to Latino populations. In health services research, specific country of birth is considered to be an important determinant of health outcomes and possibly of healthcare utilization for Latinos living in the US. 8 , 9 , 10 , 11 As of 2020, about 19.8 million (32%) of the 62.1 million Latinos in the United States were not born in the US 12 ; this group represents multiple nations of origin. However, determining the country of birth for population of patients, in a manner safe for patients and appropriately linked to health and healthcare data over time, has proven to be difficult. 13 This may be because datasets seldom contain country of birth, relevant demographics (language preference, socioeconomic status), and longitudinal clinical data together in order to most completely study the impact of country of birth on healthcare use. In asthma research, differences by country of birth or background are oft‐discussed, but are studied most often in cross‐sectional analyses, 14 , 15 , 16 , 17 and frequently utilized administrative healthcare data sources (e.g., claims) simply do not have this information. The prevalence of asthma varies across Latinos with differing birthplaces (US‐born vs. non‐US‐born; varying countries of birth) and may be confounded by the duration of residence in the United States and other socioeconomic factors. 8 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 The literature contains mixed findings on whether there are widespread, adjusted differences in the utilization of healthcare services for asthma, or in general, across varying Latino birthplace groups. Therefore, health services researchers, clinicians, and policy makers who care for Latino children with asthma, or Latino patients in general, are not equipped with adequate birth country information in real‐world clinical datasets in order to best achieve health equity.

Widespread survey collection might be the more ideal approach to increasing disaggregated data availability, because of the importance of self‐reporting of one's racial and ethnic identity. Until there is widespread implementation of this, a better estimation of the association of country of birth with health outcomes using routinely collected clinical data may help guide efforts in the future around which information needs to be collected and in what manner. Electronic health records (EHRs), an important source for health disparities research 22 , 23 , 24 in Latino populations, 25 , 26 , 27 , 28 likely contain significant populations of US and foreign‐born Latinos, 29 and may report numerous demographics relevant to many Latinos (language, insurance, migrancy). 27 , 30 , 31 However, while EHRs have been used to conduct health disparities research in populations with a high prevalence of foreign‐born Latinos, a lack of explicit country of birth data has been a persistent limitation of EHRs.

This study takes advantage of a unique multi‐state EHR network to address this limitation and to serve as a proof of concept for using EHRs to perform data disaggregation in health equity research. First, it contains extensive demographic information and linked community data that could inform approaches to infer country of birth. Secondly, for a proportion of Latinos in the network, this EHR contains a discrete data field with country of birth information, allowing validation of approaches that employs multi‐level data to infer country of birth, for the purpose of health equity research. This study aims to address calls for data disaggregation in health equity research by presenting a method using an EHR‐based approach to identify Latino country of birth. Specifically, using EHR data linked to census surname lists, census neighborhood composition, and Forebears (an online resource for surname geographic distribution across several countries), we developed and validated prediction models for foreign‐born status and specific country of birth using EHR data from a study examining asthma care utilization among a large sample of children seen in community health centers (CHCs). We hypothesized that these models would provide suitable predictive performance of country of birth for future health disparities research questions that aim to evaluate the impact of nativity and country of birth on health outcomes and utilization.

2. METHODS

2.1. Data sources

Using a retrospective cross‐sectional study design, we extracted OCHIN EHR data for children with patient‐reported country of origin. OCHIN (not an acronym) is a nonprofit health information technology organization, which provides a single, patient‐linked instance of the Epic® EHR (each patient has a single ID number and medical record shared across every clinic in the network). 32 , 33 OCHIN EHR data come from the Accelerating Data Value Across a National Community Health Center Network (ADVANCE) clinical research network (CRN). Specifically, we extracted data from structured EHR fields for patients across 456 safety net primary care clinics live on the OCHIN network for the years 2012–2018. These clinics were located in 15 states (AK, CA, GA, IN, MA, MN, MT, NC, NM, NV, OH, OR, TX, WA, and WI). OCHIN CHCs follow the Office of Management Budget standards for race and ethnicity by collecting these two variables separately. We excluded patients with missing ethnicity data (n = 594 patients). Next, we included Latino children with a date of birth after January 1, 1999 and ≥1 ambulatory visits to an OCHIN clinic from 2012 to 2018. We defined children as Latino if they self‐reported Hispanic or indicated Spanish as their preferred language, as not all of the Hispanic ethnicity may identify with that term, and broad definitions of this ethnic group include Spanish as a fundamental defining feature. 34 Race is not part of our algorithm, as Latino patients identify race in varied patterns that do not have clear associations with national origin. 35 Patients were included if their self‐reported country of origin data were documented in the EHR and denoted them as being US‐born or born in Cuba, Guatemala, Mexico, Nicaragua, or Panama as these were the most prevalent Latino countries of origin observed in our patient population. At the time of registration, patients have the option to report their country of origin along with other demographic data, such as language, need for interpreter, religion, and marital status. This information is entered and stored in the EHR and the main patient table in the OCHIN Clarity database. Of note, country of origin information is not routinely collected across clinics and this is one of the first studies to utilize this information from EHRs. In our study population of children, country of birth information is collected on 10.9% of Latino pediatric patients. While this is a low proportion of total Latino patients, it includes 19,985 Latino pediatric patients, who represent a large total number of US‐born and foreign‐born patients not usually seen in health services datasets. We also required children to have valid home address data from 2010 to 2018, which was geocoded to allow identification of the census tract to link with neighborhood composition derived from the American Community Survey (ACS). We used the ACS to obtain the percent of the national population of each census tract that classified themselves as being of each respective country of origin during the 2012–2016 measurement period.

Finally, we used a two‐step approach to securely use EHR‐derived surname data to estimate ethnic origin by surname geographic distribution using the forebears.io genealogical source. 36 , 37 The Forebears data are derived from voter and jury lists, telephone directories, and public registries such as tax, land, and welfare sources. In the first step, for each country in our study (Mexico, Guatemala, Cuba, Nicaragua, and Panama) we derived the prevalence of each surname on the 2010 US Census Bureau list of frequently occurring surnames 38 (names that were given by ≥100 respondents). In the second step, we attached the prevalence of each name by country to OCHIN patients by linking patient surnames to names on the Census Bureau list. This led to each OCHIN patient having an estimate of the prevalence of their surname in Mexico, Guatemala, Cuba, Nicaragua, and Panama. In this way, no identifying data left the secure EHR database while few patients were excluded for having rare or mistaken names.

2.2. Outcomes

The primary outcomes were binary indicators denoting patient‐reported Latino nativity and subgroup. For nativity, we considered a binary outcome that evaluated whether a patient was US‐born versus foreign‐born, where foreign‐born includes any Latino country seen in our sample. For specific Latin American countries of birth, these included: (1) Mexico versus not‐Mexico, (2) Cuba versus non‐Cuba, and (3) Guatemala versus non‐Guatemala. Two other Latin American countries were included in the country of origin data set but were not considered as discrete outcomes for country‐specific prediction because the small sample sizes were not conducive to proper modeling (Panama, N = 4; Nicaragua, N = 23). Because Puerto Rico is a US territory and not separately categorized in the country of origin data, we were unable to identify those born in Puerto Rico and thus not able to produce a prediction model for this group.

2.3. Inputs and predictors

2.3.1. EHR data

Multiple patient‐level characteristics were extracted from the OCHIN EHR that were hypothesized to predict our outcomes. Specifically, we considered patient age, primary language (English, Spanish, other), self‐reported Hispanic status (yes/no), self‐reported family size, tobacco use, number of ambulatory visits in the study period, proportion of ambulatory visits where the insurance status was uninsured, categorization of insurance utilization over the study period (all clinic visits uninsured, some clinic visits uninsured, any public insurance, any private insurance, missing), patient's US region of residence (south, west, midwest, northeast), an indicator if the patient ever reported being homeless during any encounter in the study period, an indicator if patient ever reported migrant seasonal worker status during any encounter in the study period, and urban/rural status of patient's primary clinic.

2.3.2. Neighborhood data

Using the patient's geocoded address and census tract linked to the ACS, we also considered inputs that described neighborhood‐level ethnic composition. Specifically, at the census tract level, we included the percent of Cubans, Guatemalans, Mexicans, Nicaraguans, and Panamanians depending on the outcome under study, each as their own inputs.

2.3.3. Surname data

From surname data linked to Forebears, we included the prevalence of the patient's surname in each country (US, Cuba, Guatemala, Mexico, Nicaragua, and Panama) as inputs.

2.3.4. Missing data

A total of 24,493 patients met sample criteria prior to removing missing data. Where possible, missing data were incorporated as an ‘unknown’ level for categorical inputs, otherwise, patients with missing data due to address records that could not be geocoded because of lack of address data or mis‐entered address data (n = 4475) or unidentified surname (n = 33) were removed from the analysis. After applying sample criteria, including removing missing data, a total of 19,985 pediatric Latino patients were included in the study.

2.4. Statistical methods

Our approach and reporting were informed by the TRIPOD statement for prediction models. 39 First, we split our study sample into a training, or model development, data set (80% of the study sample) and a testing, or validation, data set (20% of the study sample) using simple random sampling. Descriptive analyses were conducted to report patient characteristics, overall and by training and validation cohorts.

Inputs were given equal weight, regardless of the unit of measurement, by centering and scaling them. 40 Each level of the categorical inputs was regarded as a separate binary variable compared to a reference group. Numeric inputs were scaled to have a mean of 0 and a standard deviation of 0.5. The observed distributions in the training dataset were independently applied to the testing dataset.

To build our prediction models we leveraged Super Learner, an ensemble method that utilizes cross‐validation to select an optimal combination of prediction across multiple learners (i.e., a diverse set of algorithms); in this way, the training data set serves the purpose of both training and validation. 41 , 42 It does so by creating an optimal weighted average of those algorithms (i.e., an ensemble). To calculate the contribution of each candidate algorithm (i.e., weight) to the Super Learner prediction, we used stratified 10‐fold cross‐validation and the rank loss function to define the optimal combination as the convex combination that maximizes the area under the receiver operating characteristic curve (AUC). Weights were constrained to be non‐negative and sum to one. AUC‐maximizing metalearners can outperform accuracy‐based performance measures for binary classification problems, particularly when the response is imbalanced. 43

The number and diversity of candidate algorithms, even when they are weakly informative, improve the performance of the ensemble learner. 41 , 42 In our study we considered nine general approaches: logistic regression, stepwise selection, regularization, Bayesian GLM, smoothing splines, random forests, tree‐based boosting, k‐nearest neighbors, and support vector machines. Many of these models required further specification of hyperparameters, and the full list of algorithms included is given in Table S1. In some instances, optimal hyperparameters were chosen with nested cross‐validation within the implementation of the algorithm.

We considered the three input sources—EHR, neighborhood, and surname data—separately so that they could contribute variably to different algorithms. This allowed us to examine the respective model improvement associated with each additional, novel data source. In total, there were 54 learners evaluated in the ensemble for each of the 4 outcomes.

We evaluated both the performance of our Super Learner and its predictions. In the training data set, we examined the relative performance of the Super Learner and each learner in the ensemble with cross‐validated AUC. The primary prediction performance metric was the AUC derived from the testing data set as the AUC summarizes predictive accuracy under the full range of predictive probability cutoff values, and therefore does not require selecting a single probability cutoff to classify patients into groups. To judge our ability to infer Latino nativity for new patients we calculated the AUC among the testing data set using the models and their weighted contributions to the ensemble learner established in the training set. Additionally, for the final model of each of the four outcomes, we estimated classification accuracy measures including sensitivity, specificity, accuracy, and F1 scores at various representative cutpoints in predicted probabilities. The F1 score is a measure that combines the precision and recall of a classifier through its harmonic mean. The primary purpose of the F1 score is to compare the performance of two classifiers. F1 scores range from 0 to 1 with higher scores denoting lower false positivites and lower false negatives.

All analyses were completed using R version 3.4.0. The study was approved by our Institutional Review Board.

3. RESULTS

3.1. Participants

A total of 19,985 pediatric Latino patients met the study criteria (a comparison of patients that met the study criteria to those that did not due to missing ethnicity data are included in Table S2). Characteristics of patients in the full sample, as well as in the training and validation datasets are reported in Table 1. The pediatric patient sample had a mean age of 9.3 years and an average of 10 visits during the study period, 59% reported Spanish as their preferred language, and 88% were publically insured. About 60% of our patients were located in the western US and about 25% were from the northeast. Overall, the distribution of patient characteristics was similar between the training and validation datasets.

TABLE 1.

Characteristics of pediatric patients in the full, training, and validation samples.

Full dataset Training dataset Validation dataset
All patients meeting inclusion criteria Patients used for model development Patients reserved for final performance statistics
N (%) 19,985 (100.0) 15,988 (80.0) 3997 (20.0)
Country of origin, N (%)
United States 17,844 (89.3) 14,258 (89.2) 3586 (89.7)
Mexico 1012 (5.1) 815 (5.1) 197 (4.9)
Guatemala 949 (4.7) 769 (4.8) 180 (4.5)
Cuba 153 (0.8) 123 (0.8) 30 (0.8)
Nicaragua 23 (0.1) 21 (0.1) 2 (0.1)
Panama 4 (0.0) 2 (0.0) 2 (0.1)
Self‐reported Latino a
Yes 19,718 (98.7) 15,774 (98.7) 3944 (98.7)
No 267 (1.3) 214 (1.3) 53 (1.3)
Age at last encounter, mean (SD) 9.3 (5.2) 9.3 (5.2) 9.3 (5.3)
Primary language, N (%)
English 7994 (40.0) 6367 (39.8) 1627 (40.7)
Spanish 11,878 (59.4) 9526 (59.6) 2352 (58.8)
Other 113 (0.6) 95 (0.6) 18 (0.5)
Family size b , N (%)
0–1 4126 (20.6) 3300 (20.6) 826 (20.7)
2–3 4149 (20.8) 3353 (21.0) 796 (19.9)
4–5 5428 (27.2) 4332 (27.1) 1096 (27.4)
6 or more 1527 (7.6) 1217 (7.6) 310 (7.8)
Unknown 4755 (23.8) 3786 (23.7) 969 (24.2)
Tobacco use, N (%)
Yes 668 (3.3) 520 (3.3) 148 (3.7)
No 19,317 (96.7) 15,468 (96.7) 3849 (96.3)
Number of visits, mean (SD) 10.1 (13.1) 10.1 (13.3) 10.0 (12.4)
Percent of uninsured visits, mean (SD) 6.2 (18.6) 6.2 (18.7) 6.1 (18.3)
Insurance status, N (%)
All uninsured 521 (2.6) 424 (2.7) 97 (2.4)
Any public 17,657 (88.4) 14,137 (88.4) 3520 (88.1)
Any private 564 (2.8) 448 (2.8) 116 (2.9)
Mix 1243 (6.2) 979 (6.1) 264 (6.6)
Geographic region, N (%)
South 1964 (9.8) 1601 (10.0) 363 (9.1)
West 11,853 (59.3) 9486 (59.3) 2367 (59.2)
Midwest 1499 (7.5) 1191 (7.4) 308 (7.7)
Northeast 4669 (23.4) 3710 (23.2) 959 (24.0)
Ever experienced homelessness, N (%)
Yes 593 (3.0) 480 (3.0) 113 (2.8)
No 19,392 (97.0) 15,508 (97.0) 3884 (97.2)
Ever a migrant seasonal worker, N (%)
Yes 257 (1.3) 209 (1.3) 48 (1.2)
No 19,728 (98.7) 15,779 (98.7) 3949 (98.8)
Clinic location
Urban 17,913 (89.6) 14,317 (89.5) 3596 (90.0)
Rural/unknown 2072 (10.4) 1671 (10.5) 401 (10.0)
Percent Cubans in census tract, mean (SD) 0.2 (0.5) 0.2 (0.5) 0.2 (0.5)
Percent Guatemalans in census tract, mean (SD) 4.4 (7.6) 4.4 (7.6) 4.5 (7.6)
Percent Mexicans in census tract, mean (SD) 35.8 (28.9) 36.0 (28.9) 35.1 (28.7)
Percent Nicaraguans in census tract, mean (SD) 0.3 (0.7) 0.3 (0.7) 0.3 (0.7)
Percent Panamanians in census tract, mean (SD) 0.0 (0.2) 0.0 (0.2) 0.0 (0.2)
Percent of surname in USA c , mean (SD) 16.9 (21.5) 16.8 (21.4) 17.1 (21.9)
Percent of surname in Cuba c , mean (SD) 2.0 (3.2) 2.0 (3.2) 1.9 (3.0)
Percent of surname in Guatemala c , mean (SD) 4.7 (12.3) 4.7 (12.6) 4.4 (11.1)
Percent of surname in Mexico c , mean (SD) 27.6 (19.0) 27.6 (19.0) 27.9 (19.0)
Percent of surname in Nicaragua c , mean (SD) 1.2 (2.8) 1.2 (2.9) 1.2 (2.1)
Percent of surname in Panama c , mean (SD) 0.6 (1.2) 0.6 (1.3) 0.6 (1.1)
a

Patients were identified from the EHR as Latino if they reported Hispanic ethnicity or Spanish as primary language; there were a number of patients who spoke Spanish primarily, but who did not self‐report as Hispanic when asked about ethnicity.

b

Number of persons in family/household supported by the household income.

c

For each surname, we estimated the following conditional likelihood (expressed as a percent): of the world's population with that specific surname, what percentage is found in every specific country. For example, among all people with the Hernandez surname, what percent are located in Mexico, USA, Cuba, and so forth.

Of the 19,985 Latino patients, 10.7% reported a non‐US country of birth (5.1% Mexican, 4.7% Guatemalan, and 0.8% Cuban) while the rest reported being US‐born (89.3%). When comparing US‐born to non‐US‐born patients (Table 2), we found that US‐born Latinos were more frequently English preferring than non‐US‐born Latino. US‐born Latinos were also less likely to have uninsured visits and the majority of their visits were covered with public insurance (91.2%). They also reported smaller family sizes. Of note, US‐born Latinos lived in neighborhoods with a higher percent of Mexicans (37.1%) compared to non‐US‐born Latinos (25.4%). Comparison of patient characteristics between Mexican‐born versus non‐Mexican‐born (Table S3), Guatemalan‐born versus non‐Guatemalan‐born (Table S4), and Cuban versus non‐Cuban born (Table S5) are reported in supplementary materials.

TABLE 2.

Characteristics of pediatric patients who are US and non‐US‐born among the full study sample.

Overall US born Non‐US born SMD a
All patients meeting inclusion criteria Patients reported being born in the US Patients reported being born in Cuba, Guatemala, Mexico, Nicaragua, or Panama
N (%) 19,985 (100.0) 17,844 (89.3) 2141 (10.7)
Country of origin, N (%) N/A
United States 17,844 (89.3) 17,844 (100.0) 0 (0.0)
Mexico 1012 (5.1) 0 (0.0) 1012 (47.3)
Guatemala 949 (4.7) 0 (0.0) 949 (44.3)
Cuba 153 (0.8) 0 (0.0) 153 (7.1)
Nicaragua 23 (0.1) 0 (0.0) 23 (1.1)
Panama 4 (0.0) 0 (0.0) 4 (0.2)
Self‐reported Latino b 0.078
Yes 19,718 (98.7) 17,625 (98.8) 2093 (97.8)
No 267 (1.3) 219 (1.2) 48 (2.2)
Age at last encounter, mean (SD) 9.3 (5.2) 8.9 (5.2) 12.1 (4.9) 0.625
Primary language, N (%) 0.884
English 7994 (40.0) 7815 (43.8) 179 (8.4)
Spanish 11,878 (59.4) 9927 (55.6) 1951 (91.1)
Other 113 (0.6) 102 (0.6) 11 (0.5)
Family size c , N (%) 0.698
0–1 4126 (20.6) 3653 (20.5) 473 (22.1)
2–3 4149 (20.8) 3634 (20.4) 515 (24.1)
4–5 5428 (27.2) 4612 (25.8) 816 (38.1)
6 or more 1527 (7.6) 1263 (7.1) 264 (12.3)
Unknown 4755 (23.8) 4682 (26.2) 73 (3.4)
Tobacco use, N (%) 0.013
Yes 668 (3.3) 601 (3.4) 67 (3.1)
No 19,317 (96.7) 17,243 (96.6) 2074 (96.9)
Number of visits, mean (SD) 10.1 (13.1) 10.3 (13.3) 8.6 (11.3) 0.134
Percent of uninsured visits, mean (SD) 6.2 (18.6) 4.8 (15.7) 17.4 (32.5) 0.492
Insurance status, N (%) 0.691
All uninsured 521 (2.6) 295 (1.7) 226 (10.6)
Any public 17,657 (88.4) 16,281 (91.2) 1376 (64.3)
Any private 564 (2.8) 374 (2.1) 190 (8.9)
Mix 1243 (6.2) 894 (5.0) 349 (16.3)
Geographic region, N (%) 0.217
South 1964 (9.8) 1770 (9.9) 194 (9.1)
West 11,853 (59.3) 10,762 (60.3) 1091 (51.0)
Midwest 1499 (7.5) 1295 (7.3) 204 (9.5)
Northeast 4669 (23.4) 4017 (22.5) 652(30.4)
Ever experienced homelessness, N (%) 0.182
Yes 593 (3.0) 458 (2.6) 135 (6.3)
No 19,392 (97.0) 17,386 (97.4) 2006 (93.7)
Ever a migrant seasonal worker, N (%) 0.150
Yes 257 (1.3) 188 (1.1) 69 (3.2)
No 19,728 (98.7) 17,656 (98.9) 2072 (96.8)
Clinic location 0.019
Urban 17,913 (89.6) 15,983 (89.6) 1930 (90.1)
Rural/unknown 2072 (10.4) 1861 (10.4) 211 (9.9)
Percent Cubans in census tract, mean (SD) 0.2 (0.5) 0.2 (0.5) 0.3 (0.6) 0.117
Percent Guatemalans in census tract, mean (SD) 4.4 (7.6) 4.2 (7.4) 5.8 (8.5) 0.193
Percent Mexicans in census tract, mean (SD) 35.8 (28.9) 37.1 (28.9) 25.4 (26.7) 0.421
Percent Nicaraguans in census tract, mean (SD) 0.3 (0.7) 0.3 (0.7) 0.2 (0.6) 0.089
Percent Panamanians in census tract, mean (SD) 0.0 (0.2) 0.0 (0.2) 0.0 (0.2) 0.090
Percent of surname in USA d , mean (SD) 16.9 (21.5) 16.7 (21.3) 17.9 (23.1) 0.053
Percent of surname in Cuba d , mean (SD) 2.0 (3.2) 1.9 (2.6) 2.6 (6.4) 0.155
Percent of surname in Guatemala d , mean (SD) 4.7 (12.3) 4.1 (11.1) 9.1 (19.3) 0.315
Percent of surname in Mexico d , mean (SD) 27.6 (19.0) 27.4 (19.1) 29.2 (18.3) 0.096
Percent of surname in Nicaragua d , mean (SD) 1.2 (2.8) 1.2 (2.6) 1.5 (4.1) 0.083
Percent of surname in Panama d , mean (SD) 0.6 (1.2) 0.6 (1.3) 0.6 (0.9) 0.008

Note: Other outcomes are described in Table S3 through Table S5.

a

SMD, Standardized mean difference between US‐born patients and non‐US‐born patients.

b

Patients were identified from the EHR as Latino if they reported Hispanic ethnicity or Spanish as primary language; there were a number of patients who spoke Spanish primarily, but who did not self‐report as Hispanic when asked about ethnicity.

c

Number of persons in family/household supported by the household income.

d

For each surname, we estimated the following conditional likelihood (expressed as a percent): of the world's population with that specific surname, what percentage is found in every specific country. For example, among all people with the Hernandez surname, what percent are located in Mexico, USA, Cuba, and so forth.

3.2. Model predictive performance

Using the training data set (N = 15,988), cross‐validated AUCs for the 55 prediction algorithms (54 learners plus the SuperLearner algorithms) ranged from 0.743 to 0.884 for predicting US‐born versus foreign‐born, 0.678–0.885 for predicting Mexican versus non‐Mexican, 0.834–0.947 for Guatemalan versus non‐Guatemalan, and 0.752–0.984 for Cuban versus non‐Cuban. Figure 1 displays the Super Learner and single best and worst performing algorithms for each outcome. The ensemble method—Super Learner—demonstrated the best performance across all modeled responses. The best discrete base learner was achieved with Random Forest that included all patient‐level EHR covariates, neighborhood‐level data, and patient surname data. While some algorithms perform well for some outcomes, the same algorithms do not perform comparatively for other outcomes. For example, support vector machines were in the top half best performing when modeling Guatemalan or Cuban, but one of the worst algorithms to predict US‐born or Mexican. See Table S6 for a table with performance and rank of the ensemble learner and all 56 individual prediction algorithms considered in this study. The coefficients for the weighted ensemble Super Learner are presented in Table S7. This table shows that the Super Learner utilizes several prediction algorithms for each outcome and that for all outcomes the highest weighted algorithm was the random forest model which included all the available inputs.

FIGURE 1.

FIGURE 1

Comparison of cross‐validated area under the curve (AUC) by the two best and two worst performing prediction algorithms for Non‐US‐born, Mexican, Guatemalan, and Cuban models. Latino non‐US‐born only includes the following countries: Cuba, Guatemala, Mexico, Nicaragua, Panama. Specific prediction models for Panama and Nicaragua were not produced as the sample sizes were not conducive to proper modeling (Panama, N = 4; Nicaragua, N = 23). Algorithm labels specify Super Learner wrapper, hyperparameter, and subset of inputs that gives cross‐validated AUC estimate. The EHR suffix refers to predicators that were derived only from patients' Electronic Health Records. Algorithms with a suffix of Name used EHR, neighborhood‐level data, and patient surname data in prediction. The performance of all prediction algorithms are reported in Figure S1 and Table S6. AUC, area under the curve; CI, confidence interval. [Color figure can be viewed at wileyonlinelibrary.com]

In our withheld validation data set (N = 3997), prediction using the Super Learner for US‐born versus non‐US‐born (Figure 2 and Table 3) showed outstanding performance (AUC = 0.898, 95% confidence interval [CI] = 0.883–0.913). We observed similar performance for the outcomes of Mexican versus non‐Mexican (AUC = 0.891, 95% CI = 0.871–0.912), Guatemalan versus non‐Guatemalan (AUC = 0.952, 95% CI = 0.939–0.966) and Cuban versus non‐Cuban (AUC = 0.996, 95% CI = 0.992–1.000).

FIGURE 2.

FIGURE 2

Receiver operating characteristics curves for Latino nativity prediction models using the Super Learner in the withheld validation data set. [Color figure can be viewed at wileyonlinelibrary.com]

TABLE 3.

Area under the curve for different Latino subgroups in the withheld validation data set.

Modeled response Validation data set
AUC 95% CI
US‐born versus non‐US‐born 0.898 0.883–0.913
Mexican‐born versus non‐Mexican‐born 0.891 0.871–0.912
Guatemalan‐born versus non‐Guatemalan‐born 0.952 0.939–0.966
Cuban born versus non‐Cuban born 0.996 0.992–1.000

Note: Latino non‐US‐born only includes the following countries: Cuba, Guatemala, Mexico, Nicaragua, Panama. Specific prediction models for Panama and Nicaragua were not produced as the sample sizes were not conducive to proper modeling (Panama, N = 4; Nicaragua, N = 23).

Abbreviations: AUC, area under the curve; CI, confidence interval.

Tables S8–S11 show additional model performance metrics including sensitivity, specificity, accuracy and F1 scores at various representative cutpoints in predicted probabilities. Overall, for all outcomes, the higher the cutpoint in predicted probability, the higher the accuracy in classification. The highest F1 scores were obtained at the different percentile of probability cutoffs for each outcome (US‐born percentile probability cutoff of 90%, F1 = 0.550; Mexican‐born percentile probability cutoff of 94%, F1 = 0.407; Guatemalan‐born percentile probability cutoff of 97%, F1 = 0.567; Cuban‐born percentile probability cutoff of 99%, F1 = 0.657).

4. DISCUSSION

This study aimed to address calls for data disaggregation in health equity research by presenting a case study in pediatric asthma that used multilevel EHR, surname, and community data to analyze Latino country of birth for the purposes of studying healthcare disparities. Using predictive modeling with a gold standard of EHR self‐reported country of birth among Latino pediatric patients across 15 states, we were able to infer, with good accuracy, foreign‐born status and specific country of origin (limited to Mexico, Guatemala, and Cuba). All models showed outstanding discrimination between foreign‐born status and specific country of origin. This study is a novel contribution to a growing body of literature on the use of imputing missing broad race and ethnicity categories in administrative records, suggesting a new approach to address more specific missing country of birth in some contexts. 44 , 45 , 46 , 47 , 48 Although prediction algorithms are imperfect and cannot replace self‐reported nativity and country of origin, our modeling may be useful for health services, primary care, and public health researchers to better understand how Latino subgroups utilize health care resources and experience health outcomes in the absence of patient‐reported data. Moreover, it may allow practitioners and researchers to begin to understand how more widespread collection of this data from patients may be most useful in the future.

It is notable that the highest‐performing algorithm in the Super Learner ensemble was the algorithm that incorporated the broadest amount of information, including patient‐level EHR data in combination with surname and community‐level subgroup information. This again suggests that single proxies or models with few data points are less helpful than approaches that utilize multilevel data to understand the healthcare utilization of Latino subpopulations.

There are two major next steps in this line of research. First, this algorithmic approach needs to be tested in adults and in larger populations of Latino patients with more countries of birth than the ones that were examined here. Second, this approach should be applied to a larger sample of Latino patients in EHR networks for which no country of origin data is present, to start answering health equity‐related research questions in specific clinical domains (e.g., childhood asthma).

While this approach to inferring country of birth in Latino patients for the purpose of health equity research—especially to understand in what clinical circumstances country of birth may be important to collect—bears promise, we recognize the importance of emphasizing what these findings do not convey and for what purposes they are not suitable for use. At present, this approach may not be suitable for clinical or healthcare use, nor use in policy decisions. Our modeling is designed to incrementally refine quantitative methods for understanding the use of country of birth in health equity research. At this moment, it is not accurate (or intended) at the patient level to help provide care, nor would it be ethical to assign ethnic or nativity designations to an individual apart from their own self‐identification. 49 Indeed, our goal is not to classify patients into discrete subgroups on the basis of whether the predicted probability exceeds a threshold. Instead, we encourage researchers to model the estimated predicted probabilities as independent variables in subsequent health disparities‐related research questions. This approach avoids assigning nativity designations and has been shown to reduce bias compared to methods that categorize patients based on a threshold, 45 , 50 which is not our objective here. Instead, we plan to evaluate healthcare outcome disparities by Latino subgroups by working directly with the estimated probabilities without producing individual classifications.

With regard to policy, it would be premature to make policy or resource decisions based on these findings. As stated above, these findings need to be replicated in broader Latino populations and they are intended to help understand, in the absence of widespread self‐report data linked to healthcare data, in what circumstances and in what health conditions country of birth data might be beneficial to collect in the future. We believe, at current, it helps advance a methodological mandate for disaggregated data, 5 , 6 not inform a policy discussion.

While there have been numerous calls for and approaches to data disaggregation, recent authors 49 have expertly detailed the ongoing ethical considerations surrounding the disaggregation of minority group data. Specifically, Brown et al. 49 at the Urban Institute describe these challenges/risks in detail (violations of confidentiality, inaccurate estimates, and their effects, excluding communities from ownership of their data, generating data for harmful purposes) and we concur with their discussion. The study findings here need to be evaluated in the light of those persistent ethical questions. Others have pointed out challenges in data disaggregation in non‐Latino or broader populations 51 ; our analyses specifically focus on Latino populations, for whom there has been a call to investigate disaggregation more fully.

Additionally, some limitations of this study underscore the need for continued research before generalizing our results to new contexts. First, though this study utilized a withheld validation data set to estimate the model's predictive performance, it was limited to CHCs and future studies should externally validate this approach in different clinical settings and in an adult population. Additionally, the study was limited to 15 states and because we did not have patient representation from all 50 states, we were limited to including regions instead of states as prediction model inputs. Future studies should incorporate representation from across all 50 states. Second, we were limited in the number of countries on which we could build predictive models (Mexico, Guatemala, and Cuba) and future work could expand this to multiple countries. Of note, for Puerto Rican patients, our discrete country of birth data did not allow us to distinguish between mainland US‐born versus non‐mainland US‐born; this barrier needs to be overcome in future research. Next, some patients did not have adequate address data for geocoding and thus were excluded from the analysis which could impact study findings if those with address data systematically differed from those without.

Lastly, in our network, although country of birth is collected in other populations that are not Latino, this specific prediction model was not developed to be utilized in non‐Latino populations as only Latino countries were specifically used in the analyses. Additionally, other race and ethnicity groups may have different population characteristics (size, distribution, language) and history in the United States than Latino populations and thus different prediction model inputs may be needed. However, generally, the prediction approach taken in this study could be generalized to other groups with a different set of inputs as specified by investigators with specific expertise in those groups.

Recent interest in more robust health equity research has called attention to the importance of data disaggregation. Improving how we collect, investigate, and utilize ethnicity‐disaggregated data is central to the pursuit of health equity, but there are numerous challenges to doing this safely, ethically, and effectively. In a multistate network of CHCs with multilevel inputs from EHR data linked to surname and community data, we developed and validated novel prediction models for the use of available EHR data to infer Latino nativity for health disparities research in primary care and health services research, which is a significant potential methodologic advance in studying this population. Further work can utilize this approach in larger settings, adults, and in the context of other clinical scenarios.

CONFLICT OF INTEREST STATEMENT

None declared.

Supporting information

Appendix S1. Supplementary tables.

ACKNOWLEDGMENTS

This work was conducted with the Accelerating Data Value Across a National Community Health Center Network (ADVANCE) Clinical Research Network (CRN). ADVANCE is a CRN in PCORnet®, the National Patient Centered Outcomes Research Network. ADVANCE is led by OCHIN in partnership with Health Choice Network, Fenway Health, Oregon Health & Science University, and the Robert Graham Center HealthLandscape. ADVANCE's participation in PCORnet® is funded through the Patient‐Centered Outcomes Research Institute (PCORI), contract number RI‐OCHIN‐01‐MC.

Marino M, Fankhauser K, Minnier J, et al. Disaggregating Latino nativity in equity research using electronic health records. Health Serv Res. 2023;58(5):1119‐1130. doi: 10.1111/1475-6773.14154

REFERENCES

  • 1. Heintzman J, Marino M. The importance of primary care research in understanding health inequities in the United States. J Am Board Fam Med. 2021;34(4):849‐852. doi: 10.3122/jabfm.2021.04.210060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lett E, Asabor E, Beltrán S, Michelle Cannon A, Arah OA. Conceptualizing, contextualizing, and operationalizing race in quantitative health sciences research. Ann Fam Med. 2022;20:157‐163. doi: 10.1370/afm.2792 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Kauh TJ, Read JG, Scheitler AJ. The critical role of racial/ethnic data disaggregation for health equity. Popul Res Policy Rev. 2021;40:1‐7. doi: 10.1007/s11113-020-09631-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Office of Management and Budget. Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity. 1997.
  • 5. Alcántara C, Suglia SF, Ibarra IP, et al. Disaggregation of Latina/o child and adult health data: a systematic review of public health surveillance surveys in the United States. Pop Res Policy Rev. 2021;40(1):61‐79. [Google Scholar]
  • 6. Rubin V, Ngo D, Ross A, Butler D, Balaram N. Counting a diverse nation: disaggregating data on race and ethnicity to advance a culture of health. Policy Link. 2018;74. https://www.policylink.org/sites/default/files/Counting_a_Diverse_Nation_08_15_18.pdf [Google Scholar]
  • 7. Office of the Assistant Secretary for Planning and Evaluation . HHS Implementation Guidance on Data Collection Standards for Race, Ethnicity, Sex, Primary Language, and Disability Status. Accessed June 12, 2022. https://aspe.hhs.gov/reports/hhs‐implementation‐guidance‐data‐collection‐standards‐race‐ethnicity‐sex‐primary‐language‐disability
  • 8. Camacho‐Rivera M, Kawachi I, Bennett GG, Subramanian SV. Revisiting the Hispanic health paradox: the relative contributions of nativity, country of origin, and race/ethnicity to childhood asthma. J Immigr Minor Health. 2015;17(3):826‐833. doi: 10.1007/s10903-013-9974-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Costas‐Muñiz R, Jandorf L, Philip E, et al. Examining the impact of Latino nativity, migration, and acculturation factors on colonoscopy screening. J Community Health. 2016;41(5):903‐909. doi: 10.1007/s10900-016-0168-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Afable‐Munsuz A, Liang SY, Ponce NA, Walsh JM. Acculturation and colorectal cancer screening among older Latino adults: differential associations by national origin. J Gen Intern Med. 2009;24(8):963‐970. doi: 10.1007/s11606-009-1022-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Do DP, Frank R. The diverging impacts of segregation on obesity risk by nativity and neighborhood poverty among Hispanic Americans. J Racial Ethn Health Disparities. 2020;7(6):1214‐1224. doi: 10.1007/s40615-020-00746-2 [DOI] [PubMed] [Google Scholar]
  • 12. Pew Research Center . Key facts about U.S. Latinos for National Hispanic Heritage Month. Accessed March 14, 2022.
  • 13. Avilés‐Santa ML, Heintzman J, Lindberg NM, et al. Personalized medicine and Hispanic health: improving health outcomes and reducing health disparities–a National Heart, Lung, and Blood Institute workshop report. BMC Proc. 2017;11(Suppl 11):11. doi: 10.1186/s12919-017-0079-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Eldeirawi K, McConnell R, Freels S, Persky VW. Associations of place of birth with asthma and wheezing in Mexican American children. J Allergy Clin Immunol. 2005;116(1):42‐48. doi: 10.1016/j.jaci.2005.03.041 [DOI] [PubMed] [Google Scholar]
  • 15. Eldeirawi KM, Persky VW. Associations of acculturation and country of birth with asthma and wheezing in Mexican American youths. J Asthma. 2006;43(4):279‐286. [DOI] [PubMed] [Google Scholar]
  • 16. Holguin F, Mannino DM, Antó J, et al. Country of birth as a risk factor for asthma among Mexican Americans. Am J Respir Crit Care Med. 2005;171(2):103‐108. doi: 10.1164/rccm.200402-143OC [DOI] [PubMed] [Google Scholar]
  • 17. Iqbal S, Oraka E, Chew GL, Flanders WD. Association between birthplace and current asthma: the role of environment and acculturation. Am J Public Health. 2014;104(Suppl 1):S175‐S182. doi: 10.2105/ajph.2013.301509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lara M, Akinbami L, Flores G, Morgenstern H. Heterogeneity of childhood asthma among Hispanic children: Puerto Rican children bear a disproportionate burden. Pediatrics. 2006;117(1):43‐53. doi: 10.1542/peds.2004-1714 [DOI] [PubMed] [Google Scholar]
  • 19. Subramanian S, Jun H‐J, Kawachi I, Wright RJ. Contribution of race/ethnicity and country of origin to variations in lifetime reported asthma: evidence for a nativity advantage. Am J Public Health. 2009;99(4):690‐697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Akinbami LJ, Moorman JE, Bailey C, et al. Trends in asthma prevalence, health care use, and mortality in the United States, 2001‐2010. NCHS Data Brief. 2012;94:1‐8. [PubMed] [Google Scholar]
  • 21. Rose D, Mannino DM, Leaderer BP. Asthma prevalence among US adults, 1998‐2000: role of Puerto Rican ethnicity and behavioral and geographic factors. Am J Public Health. 2006;96(5):880‐888. doi: 10.2105/AJPH.2004.050039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Kaufmann J, Marino M, Lucas J, et al. Racial and ethnic disparities in acute care use for pediatric asthma. Ann Fam Med. 2022;20(2):116‐122. doi: 10.1370/afm.2771 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Heintzman J, Kaufmann J, Bailey S, et al. Asthma ambulatory care quality in foreign‐born Latino children in the United States. Acad Pediatr. 2021;22:647‐656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Heintzman JD, Ezekiel‐Herrera DN, Quiñones AR, et al. Disparities in colorectal cancer screening in Latinos and non‐Hispanic whites. Am J Prev Med. 2022;62(2):203‐210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Heintzman J, Kaufmann J, Ezekiel‐Herrera D, et al. Asthma/COPD disparities in diagnosis and basic care utilization among low‐income primary care patients. J Immigr Minor Health. 2019;21(3):659‐663. doi: 10.1007/s10903-018-0798-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Heintzman JD, Bailey SR, Muench J, Killerby M, Cowburn S, Marino M. Lack of lipid screening disparities in obese Latino adults at health centers. Am J Prev Med. 2017;52(6):805‐809. doi: 10.1016/j.amepre.2016.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Heintzman J, Bailey SR, DeVoe J, et al. In Low‐Income Latino Patients, Post‐Affordable Care Act Insurance Disparities May Be Reduced Even More than Broader National Estimates: Evidence from Oregon. J Racial Ethn Health Disparities. 2017;4(3):329‐336. doi: 10.1007/s40615-016-0232-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Heintzman J, Bailey SR, Cowburn S, Dexter E, Carroll J, Marino M. Pneumococcal vaccination in low‐income Latinos: an unexpected trend in oregon community health centers. J Health Care Poor Underserved. 2016;27(4):1733‐1744. doi: 10.1353/hpu.2016.0159 [DOI] [PubMed] [Google Scholar]
  • 29. Heintzman J, Marino M, Clark K, et al. Using electronic health record data to study Latino immigrant populations in health services research. J Immigr Minor Health. 2020;22(4):754‐761. doi: 10.1007/s10903-019-00925-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Heintzman J, Hatch B, Coronado G, et al. Role of race/ethnicity, language, and Insurance in use of cervical cancer prevention services among low‐income Hispanic women, 2009–2013. Prev Chronic Dis. 2018;15:E25. doi: 10.5888/pcd15.170267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Lucas JA, Marino M, Giebultowicz S, et al. Mobility and social deprivation on primary care utilisation among paediatric patients with asthma. Family Med Commun Health. 2021;9(3):e001085. doi: 10.1136/fmch-2021-001085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Devoe JE, Gold R, Spofford M, et al. Developing a network of community health centers with a common electronic health record: description of the Safety Net West Practice‐based Research Network (SNW‐PBRN). J Am Board Family Med. 2011;24(5):597‐604. doi: 10.3122/jabfm.2011.05.110052 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Devoe JE, Sears A. The OCHIN community information network: bringing together community health centers, information technology, and data to support a patient‐centered medical village. J Am Board Fam Med. 2013;26(3):271‐278. doi: 10.3122/jabfm.2013.03.120234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Noe‐Bustamante L, Mora L, Lopez MH. About one‐in‐four US Hispanics have heard of Latinx, but just 3% use it. Pew Res Center. 2020. Accessed November 1, 2022. https://www.pewresearch.org/hispanic/2020/08/11/about‐one‐in‐four‐u‐s‐hispanics‐have‐heard‐of‐latinx‐but‐just‐3‐use‐it/ [Google Scholar]
  • 35. Hugo Lopez M, Manuel Krogstad J, Passel JS. Who is Hispanic? Pew Research Center. Accessed November 29, 2022.
  • 36. Forebears Corporation . Onograph Nationality Prediction [API]. Accessed March 14, 2022, https://forebears.io/onograph/nationality
  • 37. Forebears Corp . About Forebears Names. Accessed March 14, 2022, https://forebears.io/about/name-distribution-and-demographics
  • 38. United States Census Bureau . Frequently Occurring Surnames from the 2010 Census. Accessed March 14, 2022, https://www.census.gov/topics/population/genealogy/data/2010_surnames.html
  • 39. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Br J Surg. 2015;102(3):148‐158. doi: 10.1002/bjs.9736 [DOI] [PubMed] [Google Scholar]
  • 40. Gelman A. Scaling regression inputs by dividing by two standard deviations. Stat Med. 2008;27(15):2865‐2873. doi: 10.1002/sim.3107 [DOI] [PubMed] [Google Scholar]
  • 41. Polley E. Super Learner. University of California; 2010. https://escholarship.org/uc/item/4qn0067v [Google Scholar]
  • 42. Naimi AI, Balzer LB. Stacked generalization: an introduction to super learning. Eur J Epidemiol. 2018;33(5):459‐464. doi: 10.1007/s10654-018-0390-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. LeDell E, van der Laan MJ, Petersen M. AUC‐maximizing ensembles through metalearning. Int J Biostatist. 2016;12(1):203‐218. doi: 10.1515/ijb-2015-0035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Elliott MN, Fremont A, Morrison PA, Pantoja P, Lurie N. A new method for estimating race/ethnicity and associated disparities where administrative records lack self‐reported race/ethnicity. Health Serv Res. 2008;43(5 Pt 1):1722‐1736. doi: 10.1111/j.1475-6773.2008.00854.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Elliott MN, Morrison PA, Fremont A, McCaffrey DF, Pantoja P, Lurie N. Using the Census Bureau's surname list to improve estimates of race/ethnicity and associated disparities. Health Serv Outcomes Res Methodol. 2009;9(2):69‐83. [Google Scholar]
  • 46. Xue Y, Harel O, Aseltine RH Jr. Imputing race and ethnic information in administrative health data. Health Serv Res. 2019;54(4):957‐963. doi: 10.1111/1475-6773.13171 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Zavez K, Harel O, Aseltine RH. Imputing race and ethnicity in healthcare claims databases. Health outcomes Res Methodoly. 2022;22(4):493‐507. [Google Scholar]
  • 48. Xue Y, Harel O, Aseltine R. Comparison of imputation methods for race and ethnic information in administrative health data. IEEE. 2019;1‐4. https://ieeexplore.ieee.org/document/9030977 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Brown KS, Su Y, Jagganath J, Rayfield J, Randall M. Ethics and Empathy in Using Imputation to Disaggregate Data for Racial Equity. Accessed March 17, 2022, https://www.urban.org/sites/default/files/publication/104678/ethics‐and‐empathy‐in‐using‐imputation‐to‐disaggregate‐data‐for‐racial‐equity_0.pdf
  • 50. McCaffrey DF, Elliott MN. Power of tests for a dichotomous independent variable measured with error. Health Serv Res. 2008;43(3):1085‐1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Kader F, Đoàn LN, Lee M, Chin MK, Kwon SC, Yi SS. Disaggregating race/ethnicity data categories: criticisms, dangers, and opposing viewpoints. Health Affairs Forefront. 2022. doi: 10.1377/forefront.20220323.555023 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1. Supplementary tables.


Articles from Health Services Research are provided here courtesy of Health Research & Educational Trust

RESOURCES