Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data

Sarah Conderino; Jasmin Divers; John A Dodson; Lorna E Thorpe; Mark G Weiner; Samrachana Adhikari

doi:10.1111/1475-6773.14649

. 2025 May 27;60(5):e14649. doi: 10.1111/1475-6773.14649

Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data

Sarah Conderino ^1,^✉, Jasmin Divers ^1,², John A Dodson ¹, Lorna E Thorpe ¹, Mark G Weiner ³, Samrachana Adhikari ¹

PMCID: PMC12461102 PMID: 40421571

ABSTRACT

Objective

To compare anonymized and non‐anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)‐based datasets.

Study Setting and Design

In this New York City‐based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self‐reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR‐based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches.

Data Sources and Analytic Sample

Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions.

Principal Findings

Under simulation analyses, the non‐anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κ _single = 0.25, κ _MICE = 0.25, κ _randomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis.

Conclusions

BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.

Keywords: Bayesian Improved Surname Geocoding, electronic health record, ethnicity, multiple imputation with chained equations, race, random forest imputation

Summary.

What is known on this topic:
- ○
  In electronic health record data, race and ethnicity are often missing for a substantial proportion of patients due to challenges with data collection.
- ○
  Gold standard imputation approaches rely on identifiable information, including names and addresses that are not readily available in many research databases.
What this study adds:
- ○
  Through simulation analyses, we found that single or multiple imputation using anonymized covariates was sensitive to the missing data mechanism, with performance declining when race and ethnicity was missing not at random.
- ○
  We found that imputation of race using Bayesian Improved Surname Geocoding provides a more accurate racial and ethnic classification than single or multiple imputations using anonymized covariates.

1. Introduction

Electronic health records (EHRs) and medical claims data are increasingly being leveraged for public health research, as these data contain a wealth of clinical information on large samples of individuals who are in care. However, information on race and ethnicity is often missing for a substantial proportion of individuals in these datasets due to challenges with data collection (e.g., fear of discrimination or not identifying with standardized race or ethnicity fields) [1, 2, 3]. Measuring and improving health equity is a core function of public health, making race and ethnicity a key variable in public health research [4]. Therefore, missingness in race and ethnicity can be a significant limitation to EHR data, resulting in misleading and inconclusive findings if not handled using appropriate methods in the analyses.

Excluding individuals with missing data from a study, commonly referred to as a complete case analysis, is one of the simplest analytic approaches to handling missing data. To provide valid inferences, this approach assumes that the data are missing completely at random (MCAR), or that those who are excluded are a random sample of the full dataset [5]. This assumption that the likelihood of missingness is independent of either the observed or the missing data is unlikely to hold in healthcare datasets. Patients with a missing race and ethnicity have been shown to be systematically different from those with a recorded information based on factors like insurance or healthcare utilization (e.g., less likely to have commercial insurance) [6, 7, 8, 9]. For this reason, researchers often assume race and ethnicity are missing at random (MAR)—or that missingness can be explained by observed data—and attempt to impute race and ethnicity using other variables available in these data [5]. If data are in fact missing not at random (MNAR), missingness depends on unobserved variables such as the value of the missing variable itself. Populations from certain subgroups (e.g., Hispanic/Latino) are more likely to have missing information on their race and ethnicity. When missingness is MNAR, traditional imputation methods that rely on an assumption of MAR may be insufficient and can introduce bias in the resulting statistical estimates [10].

One of the most commonly used methods for imputing race and ethnicity in EHR or claims data is Bayesian Improved Surname Geocoding (BISG) [11, 12]. Briefly, BISG predicts the probability that an individual belongs to each given racial or ethnic group based on surname and residential address. These posterior probabilities are calculated using Bayesian methods, with prior probabilities informed by the racial and ethnic distribution of the individual's residential neighborhood and likelihood informed by census lists of frequently occurring surnames. BISG has been previously validated within claims and EHR data and has been shown to perform well for most racial and ethnic groups [11, 12, 13, 14]. However, these methods rely on protected health information (PHI) such as surnames and residential address that are often removed in de‐identified or limited research databases, such as the PCORnet Clinical Research Networks [15].

Imputation methods that have been used on anonymized or de‐identified healthcare data include single imputation or multiple imputation approaches based on other available covariates [16, 17, 18]. Single imputation replaces missing values with a single imputed value [18]. Multiple imputation approaches, on the other hand, repeatedly impute missing values and summarize observations across the multiple imputed datasets. Multiple imputation generally outperforms single imputation through the incorporation of imputation uncertainty into the estimates [18, 19]. However, some studies have observed that analyses using multiply‐imputed race and ethnicity do not meaningfully differ from complete case analyses when using anonymized covariates such as demographics and comorbidities [20]. It remains unclear whether anonymized data elements are sufficient for the imputation of race and ethnicity in clinical data sources.

This study aims to compare various methods for imputing race and ethnicity in EHR data to inform whether anonymized variables are sufficient for the imputation of race and ethnicity in descriptive studies of chronic disease burden using EHR‐based datasets. We will first use data from NYU Langone Health, where we have access to surnames and addresses, to validate the performance of various imputation approaches compared to self‐reported race and ethnicity. We will then compare the “gold standard” BISG imputation approach to single and multiple imputation approaches that are limited to anonymized data elements. These anonymized methods will then be applied to impute race and ethnicity in EHR data from the INSIGHT PCORnet Clinical Research Network.

2. Methods

2.1. Data Sources and Study Samples

This study includes two EHR‐based data sources, NYU Langone Health and the INSIGHT Clinical Research Network. NYU is a large, independent academic medical center serving the greater NYC area. Data from NYU were queried from the backend of the Epic EHR system, Clarity, which includes granular PHI such as surnames and residential addresses. INSIGHT is the largest urban Clinical Research Network in the US, capturing EHR data of over 19 million diverse patients from seven top academic healthcare systems in NYC and Houston, TX (Weill Cornell Medicine, Columbia University, Montefiore Medical Center, Mount Sinai Health System, NYU Langone Health, New York‐Presbyterian Hospital, and Houston Methodist) [21]. EHR data from these contributing institutions are translated to conform to the PCORnet common data model, which is more limited than Clarity and does not include surnames or geographies lower than Zip Codes. For this study, data from NYU and Houston Methodist were excluded from the INSIGHT data. Data from Houston Methodist were excluded as they represented a different geographic area.

This work is part of a larger study that assesses the effects of the acute COVID‐19‐associated healthcare shutdown on health outcomes among NYC older adults with multiple chronic conditions (MCC). Cohorts were constructed within each data source to include adults aged 50 years and older with two or more diagnoses for chronic medical conditions (Table S1), with specific conditions derived from the Department of Health and Human Services Initiative on MCC [22]. Cohorts were further restricted to those with at least one documented ambulatory care visit in the 6 months prior to the COVID‐19 pandemic onset (9/7/2019–3/6/2020).

2.2. Variables

A consolidated race and ethnicity variable was defined to harmonize definitions across NYU and INSIGHT and to align with prior EHR‐based research [23]. Within NYU, self‐reported race and ethnicity was defined using distinct race and ethnicity fields in the Epic Clarity database. Race and ethnicity are collected at front desk registration, at self‐check‐in kiosks, or through patient portals (i.e., MyChart). These fields allow for multiple entries of racial or ethnic categories. The consolidated race and ethnicity variable was defined using the following hierarchy: (1) Latino—any Hispanic/Latino/Spanish race or ethnicity; (2) White—White race alone; (3) Black—Black/African American race alone; (4) Asian—Asian, Native Hawaiian/Pacific Islander race alone; (5) Other—multiple races, American Indian/Alaskan Native race, Middle Eastern/North African race, not listed; (6) Unknown—do not know, prefer not to answer, missing. Asian and Other categories include groupings of multiple racial and ethnic subgroups to align with BISG predictions.

Within INSIGHT, self‐reported race and ethnicity was defined using the distinct race and Hispanic fields in the PCORnet common data model. These fields allow for single entries of racial or Hispanic ethnicity categories. The consolidated race and ethnicity variable was defined using the following hierarchy: (1) Latino—Hispanic ethnicity; (2) White—White race; (3) Black—Black/African American race; (4) Asian—Asian, Native Hawaiian/Other Pacific Islander race; (5) Other—multiple race, American Indian/Alaskan Native race, other; (6) Unknown—unknown, refuse to answer, no information, missing.

Definitions for the covariates that were used for the imputation approaches are included in Table S2. Briefly, non‐anonymized covariates that were only available in the NYU cohort included patient surname and census block of residence. Anonymized covariates included both individual and census tract‐level variables that could be defined consistently across data sources and were hypothesized to be associated with race and ethnicity or missingness. Individual‐level covariates included sex, age, chronic condition comorbidities, vital measurements, and healthcare utilization indicators. Census tract‐level covariates were based on the Agency for Healthcare Research and Quality 2020 Social Determinants of Health Database [24]. Smoking status was defined for both NYU and INSIGHT but was ultimately not included in the INSIGHT analyses due to high levels of missing data (50.5%). Individuals who had missing data for any anonymized covariates besides race and ethnicity were excluded from all analyses.

2.3. Missing Data Simulations

In order to validate the imputation approaches compared to self‐reported race and ethnicity, we introduced missingness for 15% of those with known race and ethnicity (i.e., complete cases) within the NYU cohort under the three missing data mechanisms, MCAR, MAR, and MNAR. Under MCAR, race and ethnicity was set to missing on a 15% random sample of complete cases. Under MAR, we first predicted the probability of having an unknown race and ethnicity in the full NYU cohort using a logistic regression model including anonymized covariates (described above). We then set race and ethnicity to missing for a random sample of complete cases using predicted probabilities for missingness as sampling probabilities. Under MNAR, we set race and ethnicity to missing on a stratified random sample of complete cases (16% Black, 25% Latino, 17% Asian, 17% Other, and 25% White) so that missingness was dependent only on race and ethnicity itself. Simulations were repeated 100 times and results were summarized using means and 95% confidence intervals.

2.4. Imputation Approaches

Four imputation approaches were considered in these analyses: BISG, single imputation with neighborhood majority, random forest imputation, and multivariate imputation with chained equations (MICE). Random forest is an ensemble machine learning approach that constructs multiple classification or regression decision trees [25]. MICE is a sequential modeling approach that allows you to run separate, appropriately specified regression models for each variable to be imputed [26].

BISG was only conducted on the simulated and NYU cohorts, where surnames and census blocks were available. We implemented BISG following previously published methods using the wru package in R and the non‐anonymized covariates of surname and census block of residence [27]. A cutoff of 0.5 was used to classify mean posterior predicted probabilities into a given race and ethnicity category. Any individuals with all predicted probabilities below 0.5 were classified as “Unknown” [28]. As a sensitivity analysis, we tested two modifications to the BISG classification. First, we used the maximum predicted probability with no threshold. Second, we assigned race and ethnicity by sampling from a multinomial distribution using BISG predicted probabilities as sampling weights.

The remaining three anonymized imputation approaches were conducted on the simulated, NYU, and INSIGHT cohorts. Single imputation with neighborhood majority represented substituting missing or unknown race and ethnicity values with the majority race and ethnicity group from the census tract‐level covariates. Majority was defined as having at least 50% of residents identifying as the given race and ethnicity, with tracts with all categories below 50% classified as “Unknown.” As a sensitivity analysis, we assigned the neighborhood plurality, or most common race and ethnicity group from the census tract‐level covariates. We conducted random forest imputation with the missForest package in R, which uses a random forest algorithm to predict missing values based on observed values within the dataset [29]. All anonymized covariates were included in the imputation model, and we specified 100 trees and a maximum of 10 iterations. We conducted MICE imputation using the mice package in R, specifying 15 imputations, a maximum of 25 iterations, and a polytomous regression model specification [30]. All anonymized covariates were included as predictors.

2.5. Assessment of Imputation Performance

In the NYU cohort with complete race and ethnicity, we validated imputation approaches among the subset of patients that were simulated to have a missing self‐reported race and ethnicity for varying simulated missingness rates and mechanisms. For each racial and ethnic category, we compared imputed to observed self‐reported race and ethnicity using sensitivity (# correctly classified in racial and ethnic category i / # self‐reported i) and precision (# correctly classified in racial and ethnic category i / # classified i). We then calculated the overall accuracy and agreement with self‐reported race and ethnicity using Cohen's kappa.

Within the full NYU cohort, we assessed the agreement between the anonymized imputation approaches and the gold standard imputation approach of BISG among the subset of patients with an unknown self‐reported race and ethnicity using Cohen's kappa. Within the full NYU and INSIGHT cohorts, we characterized the distribution of patients by self‐reported and imputed race and ethnicity using different anonymized imputation strategies. We then estimated the burden of common (hypertension and hyperlipidemia) and rare (lung cancer, colorectal cancer) health outcomes by self‐reported and imputed race and ethnicity using logistic regression models for each outcome. Rubin's rule, which pools estimates from multiple imputed data while incorporating both within‐ and between‐imputation variance, was used to combine estimates and construct 95% confidence intervals under MICE [31].

3. Results

In total, there were 139,356 patients in the NYU cohort and 297,874 patients in the INSIGHT cohort. Only 4.9% of the NYU cohort had an unknown self‐reported race and ethnicity, whereas 16.7% of the INSIGHT cohort had an unknown race and ethnicity. INSIGHT had a larger proportion of Latino patients (19.3% vs. 9.9%) and a smaller proportion of White patients (42.6% vs. 61.0%) compared with NYU. Age and sex distributions, chronic condition prevalence, and mean vitals were similar across cohorts (Table 1). INSIGHT patients had a higher average number of ambulatory care encounters (18.71 vs. 14.84) but slightly lower average number of emergency department (0.81 vs. 1.96) or inpatient encounters (0.46 vs. 1.70) compared with NYU patients. The INSIGHT cohort had a higher mean neighborhood proportion of Latino residents (28.3% vs. 18.7%) and a lower proportion of White residents (39.9% vs. 48.1%) than the NYU cohort.

TABLE 1.

Characteristics of the NYU Langone and INSIGHT study cohorts. ^a

	NYU Langone (N = 139,356)	NYC INSIGHT (N = 297,874)
Individual‐level covariates
Self‐reported race/ethnicity
Asian	7129 (5.1)	11,430 (3.8)
Black	17,415 (12.5)	47,301 (15.9)
Latino	13,833 (9.9)	57,537 (19.3)
Other	9134 (6.6)	5038 (1.7)
White	84,962 (61.0)	126,854 (42.6)
Unknown/missing	6884 (4.9)	49,714 (16.7)
Age (mean)	69.99 (10.66)	69.32 (10.45)
Sex—male	58,368 (41.9)	121,586 (40.8)
Smoking status ^b
Current	9241 (6.6)	16,542 (5.6)
Former	49,333 (35.4)	49,849 (16.7)
Never	80,652 (57.9)	80,994 (27.2)
Unknown	130 (0.1)	150,489 (50.5)
Hyperlipidemia	106,646 (76.5)	218,866 (73.5)
Diabetes	44,567 (32.0)	106,122 (35.6)
Atrial fibrillation	24,923 (17.9)	56,851 (19.1)
Chronic kidney disease	17,125 (12.3)	49,348 (16.6)
Breast cancer	9298 (6.7)	17,540 (5.9)
Osteoporosis	21,872 (15.7)	41,624 (14.0)
Stroke	10,370 (7.4)	22,104 (7.4)
Hypertension	106,490 (76.4)	234,367 (78.7)
Arthritis	53,667 (38.5)	126,823 (42.6)
COPD	22,549 (16.2)	44,236 (14.9)
Ischemic heart disease	40,371 (29.0)	86,055 (28.9)
Heart failure	12,854 (9.2)	33,332 (11.2)
Asthma	18,203 (13.1)	45,020 (15.1)
Depression	21,127 (15.2)	50,670 (17.0)
Prostate cancer	5642 (4.0)	11,189 (3.8)
Lung cancer	2619 (1.9)	3528 (1.2)
Colorectal cancer	2691 (1.9)	4927 (1.7)
Height (mean)	63.77 (4.98)	65.03 (4.04)
Weight (mean)	174.39 (42.82)	174.81 (43.12)
Systolic blood pressure (mean)	128.72 (16.90)	130.92 (17.69)
Diastolic blood pressure (mean)	74.22 (10.02)	75.44 (9.83)
Ambulatory encounters (mean)	14.84 (14.55)	18.71 (18.44)
Emergency department encounters (mean)	1.96 (2.49)	0.81 (2.70)
Inpatient encounters (mean)	1.70 (1.51)	0.46 (1.25)
Neighborhood‐level covariates
Limited English proficiency (mean)	14.99 (14.85)	13.21 (12.49)
Unemployment (mean)	5.58 (3.89)	6.58 (4.55)
Median household income (mean)	85,421 (40,411)	82,623 (46,490)
Poverty (mean)	11.73 (8.91)	13.95 (10.66)
Medicaid insured (mean)	22.70 (16.53)	24.96 (18.20)
Uninsured (mean)	5.66 (4.31)	5.88 (4.34)
Less than high school education (mean)	13.38 (10.60)	15.54 (11.92)
Foreign born (mean)	37.12 (15.92)	34.13 (14.69)
Distance to ED (mean)	0.95 (0.70)	1.02 (1.06)
Race/ethnicity
American Indian or Alaskan Native (mean)	0.14 (0.60)	0.15 (0.72)
Asian (mean)	16.30 (15.87)	11.59 (13.42)
Black (mean)	13.46 (23.24)	16.98 (22.39)
Latino (mean)	18.67 (16.42)	28.27 (23.55)
Multiple races (mean)	2.71 (2.59)	2.39 (2.37)
Native Hawaiian or Pacific Islander (mean)	0.04 (0.31)	0.03 (0.25)
Other (mean)	0.64 (1.74)	0.68 (1.69)
White (mean)	48.08 (28.01)	39.91 (30.52)

Open in a new tab

^{^a}

Presenting counts and percents in parentheses for categorical variables and means, and standard deviations in parentheses for continuous variables.

^{^b}

Smoking status was missing or unknown for 50.1% of the INSIGHT cohort and was not included as an anonymized covariate for the random forest or MICE imputation approaches.

Figure 1 presents the sensitivity and precision of the different imputation approaches under the three missingness mechanism simulation scenarios in NYU. Under all simulation scenarios, BISG had high sensitivity for Asian, Black, Latino, and White racial and ethnic categories, meaning that a high proportion of those who self‐identified as these races were accurately classified through the surname‐based imputation approach. The anonymized imputation methods had higher sensitivity for White and Black but lower sensitivity for Asian and Latino racial and ethnic categories, with many of those who self‐reported Asian or Latino misclassified as White. Random forest had higher sensitivity for White, Black, and Latino racial and ethnic categories compared to the other anonymized imputation methods under all simulation scenarios. Under the MCAR and MAR scenarios, precision was lower for Asian and Latino than for White or Black for all imputation methods, meaning that a low proportion of those who were classified as Asian or Latino self‐identified in that racial and ethnic category. Under MNAR, precision improved for the Asian and Latino categories as the overall proportion of individuals self‐identifying as these categories increased (Figure S1). Other race and ethnic categories had low sensitivity and precision for all imputation methods and under all simulation scenarios.

Sensitivity and precision of imputation approaches by racial/ethnic category and simulation scenario. Each panel in this figure displays the mean sensitivity or precision of the imputation approaches across 100 simulations per simulation scenario. Panels are separated into columns by simulation scenario (MCAR, missing completely at random; MAR, missing at random; MNAR, missing not at random) and into rows by metric (sensitivity and precision). The color and shape of the point corresponds to the imputation method (teal square: BISG, Bayesian Improved Surname Geocoding; yellow circle: Single, single imputation using neighborhood majority; orange triangle: random forest, random forest imputation using all anonymized covariates; blue diamond: MICE, multivariate imputation with chained equations using all anonymized covariates).

BISG had the highest overall accuracy and agreement with self‐reported race and ethnicity across the three simulation scenarios, with accuracy at approximately 70% and agreement at approximately 0.6 (Table 2). The sensitivity analysis using BISG predicted probabilities as sampling weights had slightly lower accuracy and agreement than using the 50% threshold for BISG classifications, with accuracy around 65% and agreement around 0.5 across simulation scenarios (Table S4). Sensitivity analyses using (1) the maximum BISG predicted probability and (2) neighborhood plurality led to no patients being classified as having an unknown race and ethnicity. These definitions had slightly higher overall accuracy and agreement compared to the main definitions that incorporated 50% thresholds (Table S4). However, precisions for the given racial or ethnic categories decreased slightly under these sensitivity analyses (Figure S2). Random forest imputation had comparable overall accuracy and agreement to BISG under MCAR and MAR, but performance worsened under MNAR (accuracy_MNAR = 48%; kappa_MNAR = 0.33). Single imputation with the neighborhood majority and MICE had the lowest levels of accuracy and agreement with self‐reported race and ethnicity across simulation scenarios. This was driven by the misclassification of those who self‐reported Other, Asian, or Latino race and ethnicity as White. When comparing the anonymized imputation approaches to BISG as a gold standard in the NYU cohort, all approaches had fair agreement with BISG [32]. Random forest imputation had the highest agreement (kappa = 0.40), followed by single imputation with the neighborhood majority (kappa = 0.37), and MICE (kappa = 0.26).

TABLE 2.

Overall accuracy and agreement ^a of imputation approaches under the simulation scenarios ^b .

Method ^c	MCAR		MAR		MNAR
Method ^c	Accuracy (%)	Agreement (κ)	Accuracy (%)	Agreement (κ)	Accuracy (%)	Agreement (κ)
BISG	72.8 (72.7–72.9)	0.560 (0.560–0.561)	70.9 (70.9–71.0)	0.552 (0.552–0.553)	66.0 (66.0–66.1)	0.575 (0.575–0.576)
Single	56.6 (56.6–56.7)	0.310 (0.310–0.311)	53.6 (53.6–53.7)	0.305 (0.304–0.305)	37.0 (37.0–37.1)	0.248 (0.247–0.248)
Random forest	74.5 (74.4–74.5)	0.503 (0.503–0.504)	72.4 (72.3–72.4)	0.494 (0.493–0.494)	48.4 (48.3–48.4)	0.333 (0.333–0.334)
MICE	53.4 (53.4–53.5)	0.262 (0.262–0.263)	51.9 (51.8–52.0)	0.261 (0.260–0.261)	42.1 (42.0–42.2)	0.254 (0.253–0.255)

Open in a new tab

^{^a}

Presenting mean (and 95% confidence intervals) across 100 simulations per simulation scenario.

^{^b}

Scenario: MCAR, missing completely at random simulation scenario; MAR, missing at random simulation scenario; MNAR, missing not at random simulation scenario.

^{^c}

Method: BISG, Bayesian Improved Surname Geocoding; Single, single imputation using neighborhood majority; random forest, random forest imputation using all anonymized covariates; MICE, multivariate imputation with chained equations using all anonymized covariates.

Figure 2 presents the racial and ethnic distributions of the NYU and INSIGHT cohorts by self‐report or imputation method. BISG and single imputation with neighborhood majority resulted in a small proportion of individuals who were classified as having an unknown race and ethnicity (NYU_BISG: 4.9%, NYU_Single: 4.9%, INSIGHT_Single: 3.7%), whereas random forest and MICE classified all patients into known racial and ethnic categories. When those with unknown race and ethnicity are excluded, all methods resulted in racial and ethnic distributions that were comparable to the self‐reported distribution for both cohorts (Figure S3). Within both cohorts, the burden of the chronic conditions of interest was also consistent across imputation approaches by race and ethnicity (Table 3). Compared to the complete case analysis, imputation resulted in minor improvements in the precision of chronic disease burden, as measured by the width of the 95% confidence intervals, especially for smaller racial and ethnic groups.

Racial/ethnic distribution by imputation method in NYU Langone and INSIGHT cohorts. Each panel in this figure displays the proportion of patients classified into each racial and ethnic category by imputation method. Panel (A) corresponds to the NYU Langone cohort and Panel (B) corresponds to INSIGHT cohort. The color and shape of the point corresponds to the imputation method (red cross: Self, self‐reported; teal square: BISG, Bayesian Improved Surname Geocoding; yellow circle: Single, single imputation using neighborhood majority; orange triangle: random forest, random forest imputation using all anonymized covariates; blue diamond: MICE, multivariate imputation with chained equations using all anonymized covariates).

TABLE 3.

Race/ethnicity‐specific estimates of chronic disease burden ^a in the NYU Langone and INSIGHT cohorts by imputation approach ^b .

Disease	NYU Langone					NYC INSIGHT
Disease	Self (%) ^c	BISG (%)	Single (%)	Random forest (%)	MICE (%)	Self (%) ^c	Single (%)	Random forest (%)	MICE (%)
Hypertension
Asian	75.96 (74.98–76.94)	75.94 (75–76.87)	75.55 (74.6–76.5)	75.04 (74.08–75.99)	75.96 (74.98–76.94)	77.91 (77.16–78.65)	78.02 (77.32–78.72)	78.07 (77.37–78.77)	78.72 (78.02–79.43)
Black	85.62 (84.45–86.78)	85.39 (84.28–86.51)	85.38 (84.26–86.51)	85.38 (84.23–86.52)	85.62 (84.45–86.78)	87.32 (86.49–88.15)	86.79 (86.01–87.57)	86.87 (86.09–87.64)	87 (86.22–87.79)
Latino	78.62 (77.41–79.82)	78.34 (77.19–79.49)	78.54 (77.36–79.71)	78.48 (77.28–79.68)	78.62 (77.41–79.82)	82.67 (81.86–83.49)	82.54 (81.78–83.3)	82.67 (81.91–83.43)	82.2 (81.43–82.97)
Other	75.1 (73.8–76.41)	75.1 (73.82–76.37)	75.1 (73.82–76.39)	74.99 (73.71–76.28)	75.1 (73.8–76.41)	82.29 (80.95–83.64)	82.29 (80.97–83.62)	82.29 (80.99–83.6)	82.47 (81.17–83.77)
White	74.54 (73.52–75.56)	74.39 (73.42–75.37)	74.42 (73.44–75.41)	74.45 (73.45–75.45)	74.54 (73.52–75.56)	73.48 (72.7–74.26)	73.75 (73.02–74.47)	73.66 (72.93–74.38)	73.44 (72.7–74.19)
Hyper‐lipidemia
Asian	76.81 (75.83–77.79)	76.57 (75.64–77.51)	76.91 (75.96–77.85)	76.7 (75.75–77.65)	76.81 (75.83–77.79)	76.71 (75.91–77.52)	76.83 (76.08–77.59)	77.08 (76.32–77.83)	77.22 (76.39–78.04)
Black	68.17 (67–69.33)	68.19 (67.08–69.31)	68.19 (67.06–69.31)	68.21 (67.08–69.33)	68.17 (67–69.33)	66.21 (65.31–67.11)	66.7 (65.86–67.55)	66.68 (65.84–67.52)	66.69 (65.8–67.59)
Latino	74.73 (73.52–75.93)	74.58 (73.43–75.73)	74.48 (73.31–75.65)	74.58 (73.39–75.77)	74.73 (73.52–75.93)	72.26 (71.38–73.14)	72.07 (71.25–72.9)	72.41 (71.59–73.23)	71.79 (70.88–72.69)
Other	76.86 (75.55–78.16)	76.83 (75.56–78.11)	76.86 (75.57–78.14)	76.83 (75.53–78.12)	76.86 (75.55–78.16)	72.05 (70.6–73.51)	72.05 (70.62–73.48)	72.39 (70.98–73.8)	72.82 (71.37–74.26)
White	78.6 (77.58–79.62)	78.55 (77.58–79.53)	78.56 (77.58–79.55)	78.57 (77.57–79.57)	78.6 (77.58–79.62)	76.66 (75.82–77.5)	76.26 (75.47–77.05)	76.26 (75.47–77.05)	76.66 (75.79–77.53)
Lung cancer
Asian	3.24 (2.92–3.56)	2.99 (2.69–3.29)	3.11 (2.8–3.41)	3.04 (2.74–3.34)	3.24 (2.92–3.56)	1.95 (1.75–2.15)	1.82 (1.63–2)	1.79 (1.6–1.97)	1.78 (1.6–1.96)
Black	1.29 (0.91–1.67)	1.26 (0.9–1.61)	1.25 (0.89–1.62)	1.24 (0.88–1.6)	1.29 (0.91–1.67)	0.89 (0.67–1.12)	0.89 (0.68–1.1)	0.88 (0.67–1.08)	0.85 (0.65–1.05)
Latino	1.03 (0.63–1.42)	0.98 (0.61–1.35)	1.04 (0.66–1.42)	0.98 (0.6–1.35)	1.03 (0.63–1.42)	0.73 (0.51–0.95)	0.75 (0.54–0.95)	0.74 (0.53–0.94)	0.71 (0.51–0.91)
Other	1.28 (0.85–1.71)	1.28 (0.87–1.69)	1.28 (0.87–1.69)	1.22 (0.81–1.62)	1.28 (0.85–1.71)	1.01 (0.64–1.38)	1.01 (0.66–1.37)	1.04 (0.69–1.39)	0.99 (0.65–1.33)
White	2.18 (1.84–2.51)	2.14 (1.83–2.45)	2.13 (1.81–2.45)	2.13 (1.82–2.45)	2.18 (1.84–2.51)	1.56 (1.35–1.77)	1.47 (1.28–1.67)	1.46 (1.27–1.66)	1.5 (1.31–1.7)
Colorectal cancer
Asian	1.87 (1.54–2.19)	1.77 (1.46–2.07)	1.83 (1.52–2.14)	1.77 (1.46–2.08)	1.87 (1.54–2.19)	1.69 (1.45–1.93)	1.66 (1.44–1.88)	1.61 (1.39–1.83)	1.63 (1.4–1.85)
Black	1.67 (1.28–2.05)	1.59 (1.23–1.96)	1.61 (1.24–1.98)	1.58 (1.22–1.95)	1.67 (1.28–2.05)	1.63 (1.36–1.89)	1.58 (1.34–1.83)	1.54 (1.3–1.78)	1.58 (1.33–1.83)
Latino	1.67 (1.27–2.07)	1.58 (1.2–1.95)	1.62 (1.24–2)	1.6 (1.22–1.98)	1.67 (1.27–2.07)	1.59 (1.33–1.85)	1.57 (1.33–1.81)	1.6 (1.36–1.83)	1.54 (1.29–1.78)
Other	1.65 (1.22–2.09)	1.65 (1.23–2.06)	1.65 (1.23–2.07)	1.58 (1.17–2)	1.65 (1.22–2.09)	1.59 (1.16–2.02)	1.59 (1.17–2)	1.56 (1.15–1.97)	1.57 (1.16–1.98)
White	2.15 (1.81–2.49)	2.12 (1.8–2.43)	2.11 (1.79–2.43)	2.11 (1.79–2.43)	2.15 (1.81–2.49)	1.81 (1.56–2.06)	1.74 (1.51–1.97)	1.73 (1.5–1.96)	1.75 (1.52–1.99)

Open in a new tab

^{^a}

Presenting the percent of individuals with the chronic condition, with 95% confidence intervals in parentheses.

^{^b}

Method: Self, self‐reported race/ethnicity; BISG, Bayesian Improved Surname Geocoding; Single, single imputation using neighborhood majority; random forest, random forest imputation using all anonymized covariates; MICE, multivariate imputation with chained equations using all anonymized covariates.

^{^c}

Sample sizes by self‐reported race and ethnicity: NYU Langone—Asian: 7129; Black: 17,415; Latino: 13,833; Other: 9134; White: 84,962. INSIGHT—Asian: 11,430; Black: 47,301; Latino: 57,537; Other: 5038; White: 126,854.

4. Discussion

In this study, we assessed the performance of various imputation methods using raw EHR data from an academic medical center and using an EHR‐based research database from a PCORnet Clinical Research Network. Raw EHR data from NYU contained more granular PHI, including surnames and census blocks of residence, allowing us to conduct BISG imputation. INSIGHT did not include these variables, and methods were limited to single and multiple imputation using anonymized covariates. In both cohorts, we observed that all imputation approaches provided comparable racial and ethnic distributions and chronic disease burden as using self‐reported race and ethnicity. This suggests that when characterizing disparities in disease burden by race and ethnicity, results from a complete case analysis may not meaningfully differ from results when using traditional approaches to impute race and ethnicity.

In simulation analyses where we validated imputation approaches compared to self‐report, BISG was observed to have the highest overall accuracy and agreement, which was consistent across simulation scenarios. Overall BISG accuracy and agreement improved slightly in sensitivity analyses that assigned racial and ethnic category based on the maximum predicted probability with no thresholds, which was due to no individuals being classified as having an unknown race or ethnicity. However, specificities for given racial or ethnic categories decreased slightly, suggesting that individuals with predicted probabilities below 50% were more likely to be falsely classified in the given racial and ethnic categories. Sensitivity was high for all racial and ethnic categories besides Other, and precision was generally high for White and Black subgroups but was lower for Asian and Latino subgroups. Differential performance by race and ethnicity has also been shown in prior research [28, 33]. This approach leverages the census list of surnames that occur 100 or more times in the 2020 census and the census's list of approximately 12,000 common Spanish surnames. Surnames from immigrant or non‐White populations may be more likely to be missing from these lists, which could affect the overall performance of BISG when applied to more diverse populations, like NYC residents. In fact, 18% of the NYU cohort did not match the surname lists, which is substantially higher than what has been reported by other studies [33, 34]. Methods for improving performance, such as through supplementing census lists with additional surnames or incorporating first or middle names into the predictive models, could be considered when applying BISG to diverse populations [33, 34, 35].

Of the anonymized methods, random forest imputation appeared to have better accuracy and agreement with self‐reported race and ethnicity than single imputation with neighborhood majority or MICE. Random forest imputation had comparable performance to BISG under the MCAR and MAR simulation scenarios and had the highest level of agreement with BISG imputation in the NYU cohort. The accuracy of MICE was slightly better than single imputation with neighborhood majority. As expected, accuracy and agreement between all anonymized imputation approaches and self‐reported race and ethnicity were lowest under the MNAR simulation scenario. These results align with another simulation study using data from an EHR‐based research database, which found that random forest imputation more accurately classified categorical variables than MICE methods [25]. However, when these researchers analyzed performance based on coverage of confidence intervals for hazard ratio estimates, multiple imputation methods outperformed single random forest imputation with missForest [25]. The observed performance of imputation approaches may be sensitive to the selection of validation metrics. Similar to prior studies focusing on the imputation of race and ethnicity, we prioritized metrics that assessed misclassification over metrics assessing parameter bias, because we aim to use these data in numerous future analyses [36].

It is also crucial to consider the ethical implications when deciding whether to impute demographic variables like race and ethnicity. Individuals may deliberately choose to withhold their race and ethnicity, and imputation procedures may perpetuate the underrepresentation of certain subgroups [37, 38]. If race and ethnicity are not MCAR, results from a complete case analysis could mischaracterize health disparities. Imputation may reduce biases in the results. However, imputation will produce some level of error or uncertainty, and differential misclassification of race and ethnicity could introduce new biases. This study contributes to the literature by comparing the accuracy of various imputation methods and can help inform the selection of an imputation approach. Importantly, imputation does not eliminate the need for improved data collection of demographic variables. Health systems should incorporate evidence‐based strategies, such as through self‐identification via patient portals or self‐check‐in kiosks, to improve the collection of race and ethnicity [39].

This study has several limitations. First, we performed data simulations to validate the imputation approaches, because we did not have a gold standard for those who were missing race and ethnicity in the EHR (e.g., race and ethnicity linked from another data source). Simulations were conducted among the subset of complete cases of the NYU cohort. This subset may have been systematically different from those with an unknown self‐reported race and ethnicity, which could have affected the performance of the imputation approaches. Additionally, all imputation approaches relied on MAR assumptions. All approaches also had poor performance for the classification of Other race and ethnicity. In this study, category “Other” groups together those who are multiracial and those of other races (not Asian, Black, Latino, or White). This grouping continues the problematic erasure and obfuscation of data from American Indian/Alaskan Native populations [40]. Additionally, imputation approaches for this category may have poor performance because this category represents a diverse group of individuals. Disaggregating Other into more defined racial and ethnic categories would be preferable; however, this racial and ethnic hierarchy was selected to align with BISG predictions. Investigators should use more granular racial and ethnic categories when applying the anonymized imputation approaches to their research, provided sufficient sample sizes. BISG predicted probabilities are intended to be used directly rather than converted into categorical classifications of race and ethnicity, because categorical classification can mask uncertainty and reduce the accuracy of this approach. However, there are use cases when categorical classifications of race and ethnicity are warranted, and cutoffs used were based on prior research [28]. We also ran a sensitivity analysis using BISG prediction probabilities as sampling weights for the categorical classification of race and ethnicity. Results from this sensitivity analysis did not change the interpretation of the main findings of this paper. Finally, results may not be transportable to other locations or patient populations.

Despite these limitations, applying these methods to both a raw EHR data source and an EHR‐based research database is a key strength of this study. Although both data sources have large sample sizes and diverse populations, they have vastly disparate data structures that could affect missing data mechanisms. For example, research databases integrate EHR data from institutions that may use different database structures or coding systems. This requires data to be translated or standardized into a common data model, which can result in data loss, for example through reduced granularity (e.g., de‐identification, only allowing for a single racial category) or through mismatches in categorizations (e.g., “Asian/Pacific Islander” in source data but either “Asian” or “Pacific Islander” in the common data model) [41]. This was exemplified in the substantially higher level of missingness in race and ethnicity within INSIGHT than NYU. Through our application of these methods to both data sources, we demonstrate that the methods produced comparable results with these different levels of missingness and potentially different missing data mechanisms.

5. Conclusions

Our findings suggest that BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, which are limited by the strength of the association between these covariates and the true race and ethnicity. If researchers are limited to anonymized covariates, imputation using random forest models may provide more accurate classifications than MICE or single imputation with neighborhood majority. Racial or ethnic distributions and chronic disease burden were stable across imputation methods, suggesting descriptive studies of disease burden may not be sensitive to methods for imputing missing data. However, imputation may improve the precision of prevalence estimates compared to complete case analyses, particularly for smaller racial groups. Imputation will introduce some level of error and uncertainty in a study, and continued evidence‐based efforts are needed to improve the collection of race and ethnicity in EHR data.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1. hesr14649‐sup‐0001‐supinfo.

HESR-60-e14649-s001.docx^{(353.2KB, docx)}

Acknowledgments

This study was sponsored by a grant (no. R01 AG073321) from the National Institute on Aging, National Institutes of Health. Dr. Dodson is further supported by a midcareer mentoring award (no. K24AG080025) from the NIH/NIA.

Conderino S., Divers J., Dodson J. A., Thorpe L. E., Weiner M. G., and Adhikari S., “Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data,” Health Services Research 60, no. 5 (2025): e14649, 10.1111/1475-6773.14649.

Funding: This work was supported by the National Institute on Aging, National Institutes of Health (K24AG080025, R01 AG073321).

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

References

1. Brown D. P., Knapp C., Baker K., and Kaufmann M., “Using Bayesian Imputation to Assess Racial and Ethnic Disparities in Pediatric Performance Measures,” Health Services Research 51, no. 3 (2016): 1095–1108, 10.1111/1475-6773.12405. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Polubriaginof F. C. G., Ryan P., Salmasian H., et al., “Challenges With Quality of Race and Ethnicity Data in Observational Databases,” Journal of the American Medical Informatics Association 26, no. 8–9 (2019): 730–736, 10.1093/jamia/ocz113. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Baker D. W., Hasnain‐Wynia R., Kandula N. R., Thompson J. A., and Brown E. R., “Attitudes Toward Health Care Providers, Collecting Information About Patients' Race, Ethnicity, and Language,” Medical Care 45, no. 11 (2007): 1034–1042, 10.1097/MLR.0b013e318127148f. [DOI] [PubMed] [Google Scholar]
4. Truman B. I., Smith K. C., Roy K., et al., “Rationale for Regular Reporting on Health Disparities and Inequalities—United States,” MMWR Surveillance Summaries 60, no. Suppl 01 (2011): 3–10. [PubMed] [Google Scholar]
5. Rubin D. B., “Inference and Missing Data,” Biometrika 63, no. 3 (1976): 581–592. [Google Scholar]
6. Filice C. E. and Joynt K. E., “Examining Race and Ethnicity Information in Medicare Administrative Data,” Medical Care 55, no. 12 (2017): e170–e176. [DOI] [PubMed] [Google Scholar]
7. Sholle E. T., Pinheiro L. C., Adekkanattu P., et al., “Underserved Populations With Missing Race Ethnicity Data Differ Significantly From Those With Structured Race/Ethnicity Documentation,” Journal of the American Medical Informatics Association 26, no. 8–9 (2019): 722–729, 10.1093/jamia/ocz040. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Lee S. J. C., Grobe J. E., and Tiro J. A., “Assessing Race and Ethnicity Data Quality Across Cancer Registries and EMRs in Two Hospitals,” Journal of the American Medical Informatics Association 23, no. 3 (2015): 627–634, 10.1093/jamia/ocv156. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Zaslavsky A. M., Ayanian J. Z., and Zaborski L. B., “The Validity of Race and Ethnicity in Enrollment Data for Medicare Beneficiaries,” Health Services Research 47, no. 3pt2 (2012): 1300–1321, 10.1111/j.1475-6773.2012.01411.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Heymans M. W. and Twisk J. W. R., “Handling Missing Data in Clinical Research,” Journal of Clinical Epidemiology 151 (2022): 185–188, 10.1016/j.jclinepi.2022.08.016. [DOI] [PubMed] [Google Scholar]
11. Elliott M. N., Fremont A., Morrison P. A., Pantoja P., and Lurie N., “A New Method for Estimating Race/Ethnicity and Associated Disparities Where Administrative Records Lack Self‐Reported Race/Ethnicity,” Health Services Research 43, no. 5 Pt 1 (2008): 1722–1736, 10.1111/j.1475-6773.2008.00854.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Elliott M. N., Morrison P. A., Fremont A., McCaffrey D. F., Pantoja P., and Lurie N., “Using the Census Bureau's Surname List to Improve Estimates of Race/Ethnicity and Associated Disparities,” Health Services and Outcomes Research Methodology 9, no. 2 (2009): 69–83, 10.1007/s10742-009-0047-1. [DOI] [Google Scholar]
13. Derose S. F., Contreras R., Coleman K. J., Koebnick C., and Jacobsen S. J., “Race and Ethnicity Data Quality and Imputation Using U.S. Census Data in an Integrated Health System:The Kaiser Permanente Southern California Experience,” Medical Care Research and Review 70, no. 3 (2013): 330–345, 10.1177/1077558712466293. [DOI] [PubMed] [Google Scholar]
14. Grundmeier R. W., Song L., Ramos M. J., et al., “Imputing Missing Race/Ethnicity in Pediatric Electronic Health Records: Reducing Bias With Use of U.S. Census Location and Surname Data,” Health Services Research 50, no. 4 (2015): 946–960, 10.1111/1475-6773.12295. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Pletcher M. J., Forrest C. B., and Carton T. W., “PCORnet's Collaborative Research Groups,” Patient Related Outcome Measures 9 (2018): 91–95, 10.2147/prom.S141630. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Biederman J., Fried R., DiSalvo M., et al., “Evidence of Low Adherence to Stimulant Medication Among Children and Youths With ADHD: An Electronic Health Records Study,” Psychiatric Services 70, no. 10 (2019): 874–880, 10.1176/appi.ps.201800515. [DOI] [PubMed] [Google Scholar]
17. Eyllon M., Dang A. P., Barnes J. B., et al., “Associations Between Psychiatric Morbidity and COVID‐19 Vaccine Hesitancy: An Analysis of Electronic Health Records and Patient Survey,” Psychiatry Research 307 (2022): 114329, 10.1016/j.psychres.2021.114329. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Ma Y., Zhang W., Lyman S., and Huang Y., “The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data,” Health Services Research 53, no. 3 (2018): 1870–1889, 10.1111/1475-6773.12704. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Rubin D. B. and Schenker N., “Multiple Imputation in Health‐Are Databases: An Overview and Some Applications,” Statistics in Medicine 10, no. 4 (1991): 585–598, 10.1002/sim.4780100410. [DOI] [PubMed] [Google Scholar]
20. Allen B., Basaraba C., Corbeil T., et al., “Racial Differences in COVID‐19 Severity Associated With History of Substance Use Disorders and Overdose: Findings From Multi‐Site Electronic Health Records in New York City,” Preventive Medicine 172 (2023): 107533, 10.1016/j.ypmed.2023.107533. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Kaushal R., Hripcsak G., Ascheim D. D., et al., “Changing the Research Landscape: The New York City Clinical Data Research Network,” Journal of the American Medical Informatics Association 21, no. 4 (2014): 587–590, 10.1136/amiajnl-2014-002764. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Lochner K. A. and Cox C. S., “Prevalence of Multiple Chronic Conditions Among Medicare Beneficiaries, United States, 2010,” Preventing Chronic Disease 10 (2013): E61, 10.5888/pcd10.120137. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Hernandez S. E., Sylling P. W., Mor M. K., et al., “Developing an Algorithm for Combining Race and Ethnicity Data Sources in the Veterans Health Administration,” Military Medicine 185, no. 3–4 (2019): e495, 10.1093/milmed/usz322. [DOI] [PubMed] [Google Scholar]
24. “Agency for Healthcare Research and Quality,” (2025), Social Determinants of Health Database, https://www.ahrq.gov/sdoh/data‐analytics/sdoh‐data.html.
25. Shah A. D., Bartlett J. W., Carpenter J., Nicholas O., and Hemingway H., “Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study,” American Journal of Epidemiology 179, no. 6 (2014): 764–774, 10.1093/aje/kwt312. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Zhang Z., “Multiple Imputation With Multivariate Imputation by Chained Equation (MICE) Package,” Annals of Translational Medicine 4, no. 2 (2016): 30, 10.3978/j.issn.2305-5839.2015.12.63. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Imai K. and Khanna K., “Improving Ecological Inference by Predicting Individual Ethnicity From Voter Registration Records,” Political Analysis 24, no. 2 (2016): 263–272. [Google Scholar]
28. Adjaye‐Gbewonyo D., Bednarczyk R. A., Davis R. L., and Omer S. B., “Using the Bayesian Improved Surname Geocoding Method (BISG) to Create a Working Classification of Race and Ethnicity in a Diverse Managed Care Population: A Validation Study,” Health Services Research 49, no. 1 (2014): 268–283, 10.1111/1475-6773.12089. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Stekhoven D. J. and Bühlmann P., “MissForest—Non‐Parametric Missing Value Imputation for Mixed‐Type Data,” Bioinformatics 28, no. 1 (2011): 112–118, 10.1093/bioinformatics/btr597. [DOI] [PubMed] [Google Scholar]
30. van Buuren S. and Groothuis‐Oudshoorn K., “Mice: Multivariate Imputation by Chained Equations in R,” Journal of Statistical Software 45, no. 3 (2011): 1–67, 10.18637/jss.v045.i03. [DOI] [Google Scholar]
31. Rubin D. B., Multiple Imputation for Nonresponse in Surveys (John Wiley & Sons, 2004). [Google Scholar]
32. Cyr L. and Francis K., “Measures of Clinical Agreement for Nominal and Categorical Data: The Kappa Coefficient,” Computers in Biology and Medicine 22, no. 4 (1992): 239–246, 10.1016/0010-4825(92)90063-S. [DOI] [PubMed] [Google Scholar]
33. Haas A., Elliott M. N., Dembosky J. W., et al., “Imputation of Race/Ethnicity to Enable Measurement of HEDIS Performance by Race/Ethnicity,” Health Services Research 54, no. 1 (2019): 13–23, 10.1111/1475-6773.13099. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Imai K., Olivella S., and Rosenman E. T. R., “Addressing Census Data Problems in Race Imputation via Fully Bayesian Improved Surname Geocoding and Name Supplements,” Science Advances 8, no. 49 (2022): eadc9824, 10.1126/sciadv.adc9824. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Yee K., Hoopes M., Giebultowicz S., Elliott M. N., and McConnell K. J., “Implications of Missingness in Self‐Reported Data for Estimating Racial and Ethnic Disparities in Medicaid Quality Measures,” Health Services Research 57, no. 6 (2022): 1370–1378, 10.1111/1475-6773.14025. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Chin M. K., Đoàn L. N., Russo R. G., et al., “Methods for Retrospectively Improving Race/Ethnicity Data Quality: A Scoping Review,” Epidemiologic Reviews 45, no. 1 (2023): 127–139, 10.1093/epirev/mxad002. [DOI] [PubMed] [Google Scholar]
37. Lines L. M., Humphrey J. L., and Barch D. H., “Imputing Race and Ethnicity: A Fresh Voices Commentary From the Medical Care Blog,” Medical Care 60, no. 5 (2022): 351–356, 10.1097/mlr.0000000000001717. [DOI] [PubMed] [Google Scholar]
38. Lockhart J. W., King M. M., and Munsch C., “Name‐Based Demographic Inference and the Unequal Distribution of Misrecognition,” Nature Human Behaviour 7, no. 7 (2023): 1084–1095, 10.1038/s41562-023-01587-9. [DOI] [PubMed] [Google Scholar]
39. Weathers A. L., Garg N., Lundgren K. B., Benish S. M., Baca C. B., and Benson R. T., “Improved Accuracy/Completeness of EHR Race/Ethnicity Data,” Neurology Clinical Practice 14, no. 3 (2024): e200313, 10.1212/CPJ.0000000000200313. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Cook L., Espinoza J., Weiskopf N. G., et al., “Issues With Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave,” JMIR Medical Informatics 10, no. 9 (2022): e39235, 10.2196/39235. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Garza M., Del Fiol G., Tenenbaum J., Walden A., and Zozus M. N., “Evaluating Common Data Models for Use With a Longitudinal Community Registry,” Journal of Biomedical Informatics 64 (2016): 333–341, 10.1016/j.jbi.2016.10.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1. hesr14649‐sup‐0001‐supinfo.

HESR-60-e14649-s001.docx^{(353.2KB, docx)}

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

[hesr14649-bib-0001] 1. Brown D. P., Knapp C., Baker K., and Kaufmann M., “Using Bayesian Imputation to Assess Racial and Ethnic Disparities in Pediatric Performance Measures,” Health Services Research 51, no. 3 (2016): 1095–1108, 10.1111/1475-6773.12405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0002] 2. Polubriaginof F. C. G., Ryan P., Salmasian H., et al., “Challenges With Quality of Race and Ethnicity Data in Observational Databases,” Journal of the American Medical Informatics Association 26, no. 8–9 (2019): 730–736, 10.1093/jamia/ocz113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0003] 3. Baker D. W., Hasnain‐Wynia R., Kandula N. R., Thompson J. A., and Brown E. R., “Attitudes Toward Health Care Providers, Collecting Information About Patients' Race, Ethnicity, and Language,” Medical Care 45, no. 11 (2007): 1034–1042, 10.1097/MLR.0b013e318127148f. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0004] 4. Truman B. I., Smith K. C., Roy K., et al., “Rationale for Regular Reporting on Health Disparities and Inequalities—United States,” MMWR Surveillance Summaries 60, no. Suppl 01 (2011): 3–10. [PubMed] [Google Scholar]

[hesr14649-bib-0005] 5. Rubin D. B., “Inference and Missing Data,” Biometrika 63, no. 3 (1976): 581–592. [Google Scholar]

[hesr14649-bib-0006] 6. Filice C. E. and Joynt K. E., “Examining Race and Ethnicity Information in Medicare Administrative Data,” Medical Care 55, no. 12 (2017): e170–e176. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0007] 7. Sholle E. T., Pinheiro L. C., Adekkanattu P., et al., “Underserved Populations With Missing Race Ethnicity Data Differ Significantly From Those With Structured Race/Ethnicity Documentation,” Journal of the American Medical Informatics Association 26, no. 8–9 (2019): 722–729, 10.1093/jamia/ocz040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0008] 8. Lee S. J. C., Grobe J. E., and Tiro J. A., “Assessing Race and Ethnicity Data Quality Across Cancer Registries and EMRs in Two Hospitals,” Journal of the American Medical Informatics Association 23, no. 3 (2015): 627–634, 10.1093/jamia/ocv156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0009] 9. Zaslavsky A. M., Ayanian J. Z., and Zaborski L. B., “The Validity of Race and Ethnicity in Enrollment Data for Medicare Beneficiaries,” Health Services Research 47, no. 3pt2 (2012): 1300–1321, 10.1111/j.1475-6773.2012.01411.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0010] 10. Heymans M. W. and Twisk J. W. R., “Handling Missing Data in Clinical Research,” Journal of Clinical Epidemiology 151 (2022): 185–188, 10.1016/j.jclinepi.2022.08.016. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0011] 11. Elliott M. N., Fremont A., Morrison P. A., Pantoja P., and Lurie N., “A New Method for Estimating Race/Ethnicity and Associated Disparities Where Administrative Records Lack Self‐Reported Race/Ethnicity,” Health Services Research 43, no. 5 Pt 1 (2008): 1722–1736, 10.1111/j.1475-6773.2008.00854.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0012] 12. Elliott M. N., Morrison P. A., Fremont A., McCaffrey D. F., Pantoja P., and Lurie N., “Using the Census Bureau's Surname List to Improve Estimates of Race/Ethnicity and Associated Disparities,” Health Services and Outcomes Research Methodology 9, no. 2 (2009): 69–83, 10.1007/s10742-009-0047-1. [DOI] [Google Scholar]

[hesr14649-bib-0013] 13. Derose S. F., Contreras R., Coleman K. J., Koebnick C., and Jacobsen S. J., “Race and Ethnicity Data Quality and Imputation Using U.S. Census Data in an Integrated Health System:The Kaiser Permanente Southern California Experience,” Medical Care Research and Review 70, no. 3 (2013): 330–345, 10.1177/1077558712466293. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0014] 14. Grundmeier R. W., Song L., Ramos M. J., et al., “Imputing Missing Race/Ethnicity in Pediatric Electronic Health Records: Reducing Bias With Use of U.S. Census Location and Surname Data,” Health Services Research 50, no. 4 (2015): 946–960, 10.1111/1475-6773.12295. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0015] 15. Pletcher M. J., Forrest C. B., and Carton T. W., “PCORnet's Collaborative Research Groups,” Patient Related Outcome Measures 9 (2018): 91–95, 10.2147/prom.S141630. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0016] 16. Biederman J., Fried R., DiSalvo M., et al., “Evidence of Low Adherence to Stimulant Medication Among Children and Youths With ADHD: An Electronic Health Records Study,” Psychiatric Services 70, no. 10 (2019): 874–880, 10.1176/appi.ps.201800515. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0017] 17. Eyllon M., Dang A. P., Barnes J. B., et al., “Associations Between Psychiatric Morbidity and COVID‐19 Vaccine Hesitancy: An Analysis of Electronic Health Records and Patient Survey,” Psychiatry Research 307 (2022): 114329, 10.1016/j.psychres.2021.114329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0018] 18. Ma Y., Zhang W., Lyman S., and Huang Y., “The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data,” Health Services Research 53, no. 3 (2018): 1870–1889, 10.1111/1475-6773.12704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0019] 19. Rubin D. B. and Schenker N., “Multiple Imputation in Health‐Are Databases: An Overview and Some Applications,” Statistics in Medicine 10, no. 4 (1991): 585–598, 10.1002/sim.4780100410. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0020] 20. Allen B., Basaraba C., Corbeil T., et al., “Racial Differences in COVID‐19 Severity Associated With History of Substance Use Disorders and Overdose: Findings From Multi‐Site Electronic Health Records in New York City,” Preventive Medicine 172 (2023): 107533, 10.1016/j.ypmed.2023.107533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0021] 21. Kaushal R., Hripcsak G., Ascheim D. D., et al., “Changing the Research Landscape: The New York City Clinical Data Research Network,” Journal of the American Medical Informatics Association 21, no. 4 (2014): 587–590, 10.1136/amiajnl-2014-002764. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0022] 22. Lochner K. A. and Cox C. S., “Prevalence of Multiple Chronic Conditions Among Medicare Beneficiaries, United States, 2010,” Preventing Chronic Disease 10 (2013): E61, 10.5888/pcd10.120137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0023] 23. Hernandez S. E., Sylling P. W., Mor M. K., et al., “Developing an Algorithm for Combining Race and Ethnicity Data Sources in the Veterans Health Administration,” Military Medicine 185, no. 3–4 (2019): e495, 10.1093/milmed/usz322. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0024] 24. “Agency for Healthcare Research and Quality,” (2025), Social Determinants of Health Database, https://www.ahrq.gov/sdoh/data‐analytics/sdoh‐data.html.

[hesr14649-bib-0025] 25. Shah A. D., Bartlett J. W., Carpenter J., Nicholas O., and Hemingway H., “Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study,” American Journal of Epidemiology 179, no. 6 (2014): 764–774, 10.1093/aje/kwt312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0026] 26. Zhang Z., “Multiple Imputation With Multivariate Imputation by Chained Equation (MICE) Package,” Annals of Translational Medicine 4, no. 2 (2016): 30, 10.3978/j.issn.2305-5839.2015.12.63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0027] 27. Imai K. and Khanna K., “Improving Ecological Inference by Predicting Individual Ethnicity From Voter Registration Records,” Political Analysis 24, no. 2 (2016): 263–272. [Google Scholar]

[hesr14649-bib-0028] 28. Adjaye‐Gbewonyo D., Bednarczyk R. A., Davis R. L., and Omer S. B., “Using the Bayesian Improved Surname Geocoding Method (BISG) to Create a Working Classification of Race and Ethnicity in a Diverse Managed Care Population: A Validation Study,” Health Services Research 49, no. 1 (2014): 268–283, 10.1111/1475-6773.12089. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0029] 29. Stekhoven D. J. and Bühlmann P., “MissForest—Non‐Parametric Missing Value Imputation for Mixed‐Type Data,” Bioinformatics 28, no. 1 (2011): 112–118, 10.1093/bioinformatics/btr597. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0030] 30. van Buuren S. and Groothuis‐Oudshoorn K., “Mice: Multivariate Imputation by Chained Equations in R,” Journal of Statistical Software 45, no. 3 (2011): 1–67, 10.18637/jss.v045.i03. [DOI] [Google Scholar]

[hesr14649-bib-0031] 31. Rubin D. B., Multiple Imputation for Nonresponse in Surveys (John Wiley & Sons, 2004). [Google Scholar]

[hesr14649-bib-0032] 32. Cyr L. and Francis K., “Measures of Clinical Agreement for Nominal and Categorical Data: The Kappa Coefficient,” Computers in Biology and Medicine 22, no. 4 (1992): 239–246, 10.1016/0010-4825(92)90063-S. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0033] 33. Haas A., Elliott M. N., Dembosky J. W., et al., “Imputation of Race/Ethnicity to Enable Measurement of HEDIS Performance by Race/Ethnicity,” Health Services Research 54, no. 1 (2019): 13–23, 10.1111/1475-6773.13099. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0034] 34. Imai K., Olivella S., and Rosenman E. T. R., “Addressing Census Data Problems in Race Imputation via Fully Bayesian Improved Surname Geocoding and Name Supplements,” Science Advances 8, no. 49 (2022): eadc9824, 10.1126/sciadv.adc9824. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0035] 35. Yee K., Hoopes M., Giebultowicz S., Elliott M. N., and McConnell K. J., “Implications of Missingness in Self‐Reported Data for Estimating Racial and Ethnic Disparities in Medicaid Quality Measures,” Health Services Research 57, no. 6 (2022): 1370–1378, 10.1111/1475-6773.14025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0036] 36. Chin M. K., Đoàn L. N., Russo R. G., et al., “Methods for Retrospectively Improving Race/Ethnicity Data Quality: A Scoping Review,” Epidemiologic Reviews 45, no. 1 (2023): 127–139, 10.1093/epirev/mxad002. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0037] 37. Lines L. M., Humphrey J. L., and Barch D. H., “Imputing Race and Ethnicity: A Fresh Voices Commentary From the Medical Care Blog,” Medical Care 60, no. 5 (2022): 351–356, 10.1097/mlr.0000000000001717. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0038] 38. Lockhart J. W., King M. M., and Munsch C., “Name‐Based Demographic Inference and the Unequal Distribution of Misrecognition,” Nature Human Behaviour 7, no. 7 (2023): 1084–1095, 10.1038/s41562-023-01587-9. [DOI] [PubMed] [Google Scholar]

[hesr14649-bib-0039] 39. Weathers A. L., Garg N., Lundgren K. B., Benish S. M., Baca C. B., and Benson R. T., “Improved Accuracy/Completeness of EHR Race/Ethnicity Data,” Neurology Clinical Practice 14, no. 3 (2024): e200313, 10.1212/CPJ.0000000000200313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0040] 40. Cook L., Espinoza J., Weiskopf N. G., et al., “Issues With Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave,” JMIR Medical Informatics 10, no. 9 (2022): e39235, 10.2196/39235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[hesr14649-bib-0041] 41. Garza M., Del Fiol G., Tenenbaum J., Walden A., and Zozus M. N., “Evaluating Common Data Models for Use With a Longitudinal Community Registry,” Journal of Biomedical Informatics 64 (2016): 333–341, 10.1016/j.jbi.2016.10.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data

Sarah Conderino

Jasmin Divers

John A Dodson

Lorna E Thorpe

Mark G Weiner

Samrachana Adhikari

ABSTRACT

Objective

Study Setting and Design

Data Sources and Analytic Sample

Principal Findings

Conclusions

Summary.

1. Introduction

2. Methods

2.1. Data Sources and Study Samples

2.2. Variables

2.3. Missing Data Simulations

2.4. Imputation Approaches

2.5. Assessment of Imputation Performance

3. Results

TABLE 1.

FIGURE 1.

TABLE 2.

FIGURE 2.

TABLE 3.

4. Discussion

5. Conclusions

Conflicts of Interest

Supporting information

Acknowledgments

Data Availability Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases