Abstract
The use of data derived from electronic health records (EHRs) to describe racial and ethnic health disparities is increasingly common, but there are challenges. While the number of patients contained in EHRs can be quite large, they may not be representative of a source population. One way to evaluate the extent of this limitation is by linking EHRs to an external source, in this case with the American Community Survey (ACS). Relying on a stratified random sample of about 200,000 patient records from a large, public, integrated delivery system in North Carolina (2016-2019), we assess linkages to restricted ACS microdata (2001-2017) based on race and ethnicity to understand the strengths and weaknesses of EHR-derived data for describing disparities. Our results suggest that Black-White comparisons will benefit from standard adjustments (e.g., weighting procedures), but that misestimation of health disparities may arise for Hispanic patients due to coverage rates for this group.
Keywords: race, ethnicity, health, electronic health records
Introduction
As survey response rates fall and costs of data collection rise, researchers increasingly look to administrative and clinical data as supplements or substitutes (Meyer et al., 2015). For those interested in health disparities, data derived from electronic health records (EHRs) are one such source. Although researchers do not have direct access to EHRs, for simplicity, we refer to EHR-derived data as EHRs throughout. EHRs are increasingly being used to describe racial and ethnic disparities in health care utilization, treatment, and outcomes. Recent examples include studies of childhood obesity (Sharifi et al., 2016), maternal and neonatal delivery complications (Huennekens et al., 2020), and the receipt of treatment medications among adults seeking care and with a positive SARS-CoV-2 test (Boehmer et al., 2022; Wiltz et al., 2022). However, there are concerns related to the representation of racial and ethnic groups in EHRs, which has implications for studying racial and ethnic disparities in population research. This research note considers lessons learned about the strengths and weaknesses of EHRs for health disparities research, which have long been of interest to demographers (Foster et al., 2024; Montez et al., 2019). By linking EHRs from a large integrated health delivery system in North Carolina to microdata from the American Community Survey (ACS), our findings demonstrate the potential for the use of EHRs for population research.
Background
EHRs have tremendous potential for increasing knowledge and understanding of racial and ethnic health disparities. They contain up-to-date and detailed information pertinent to people’s health (e.g., utilization, laboratory tests and results, procedures, problem lists, visit-specific diagnoses, medications prescribed) for very large patient populations (Casey et al., 2016) and have benefits over other health data sources. Relative to health care administrative data (e.g., insurance claims data), which provide billing diagnoses and types of care regardless of where care is received, EHRs contain results of tests and biologic data such as weight or blood pressure. Compared to vital statistics, which describe health disparities at the beginning and end of life, EHRs cover the years in between. Finally, in contrast to most surveys which rely on subjective reports and are subject to potential recall bias about test results and diagnoses, EHRs contain those test results, whether they exceed clinical thresholds, and diagnoses based on them.
EHRs are a promising source to document racial and ethnic disparities in health and health care and can inform targeted interventions to ameliorate them (Rumball-Smith & Bates, 2018), but there are challenges: the data are not representative of a source population. Information based on EHRs pertains to people who obtain health care from a particular provider (or group of providers). Numbers can be quite large, but as a nonprobability sample of a source population, they are not population representative (Friedman et al., 2013; Goldstein et al., 2016; Groves, 2006). Estimation of health disparities can be biased if racial and ethnic groups are differentially represented. Representativeness may be growing since health facilities are increasingly aggregated into large integrated delivery systems, caring for up to several million people each. Nevertheless, the potential for bias remains.
As a further complication, EHRs are selective of people who choose to seek health care, which may vary between groups depending on insurance coverage and other factors (Goldstein et al., 2016; Weiskopf & Weng, 2013). Numerators and denominators may be affected differently as those who do not need or want health care or who lack insurance may have reduced representation. Denominators may exclude some of those at risk of the disease or health condition recorded in the numerator, potentially overstating prevalence as a result. Analysts typically correct for this by weighting prevalence measures or incorporating inverse proportional weights in regression analyses. Bias is lessened but not eliminated when EHRs incorporate a full range of care, from preventive to acute (Bower et al., 2017; Klompas et al., 2017). Other factors affecting health care utilization are also relevant to inclusion in EHRs, including education, health insurance coverage, and access to transportation (Bower et al., 2017). Race and ethnic differences in access to health care have narrowed over the past two decades, especially for the Hispanic population, but nevertheless persist (Ma et al., 2022; Mahajan et al., 2021).
With full information on the source population, it would be straightforward to identify potential bias and adjust for it. Instead, we take an indirect approach, assessing strengths and weaknesses of EHRs for the study of racial and ethnic health disparities through comparison with the ACS. The ACS is an annual cross-sectional survey of about 1-1.5% of the U.S. population designed as a replacement for the decennial census long-form in 2001. Critically, it is based on a probability sample designed to be representative. Although the sampling fraction is small, the number of individuals included each year is very large, more than 5 million nationally and exceeding 100,000 in North Carolina, the state served by the integrated health system we study.1 Participation in the ACS is mandatory and response rates have historically topped 90% (U.S. Census Bureau, 2022b), although coverage rates vary by race and ethnicity (U.S. Census Bureau, 2022a), a point to which we will return below.
Data and Methods
This research note leveraged data collected as part of a pilot study to assess the feasibility of integrating the detailed health information available in EHRs with social, economic, and demographic data from the ACS (Udalova et al., 2022). It focused on EHRs from a large integrated health delivery system in North Carolina from 2016-2019. The health system is composed of a large academic health center, 11 community hospitals, and many hundreds of practices across the state including both urban and rural areas. Demographic and clinical data are collected using a single enterprise-level EHR with consistent policies regarding registration and demographic and clinical data collection across hospitals and practices. Care is offered to all residents of the state regardless of ability to pay, including uninsured, resulting in a diverse patient population, including citizens and non-citizens. Over two million patients were seen during 2016-2019, prior to Medicaid expansion in the state.
In the integrated health delivery system, race and ethnicity are generally obtained from a patient or proxy as part of registration during the first visit. During 2016-2019, patients provided the information directly to a clerk or as part of an intake questionnaire. Answers were coded as follows: for race, American Indian or Alaskan Native (AIAN), Asian, Black or African American, Native Hawaiian or Other Pacific Islander (NHPI), Other Race, Patient Refused, Unknown, White or Caucasian (categories listed in alphabetical order on the screen); and for ethnicity, Hispanic or Latino, not Hispanic or Latino, Patient Refused, Unknown. Clerks were instructed to accept whatever answer they were given and specifically not to push if there was resistance and not to fill in based on their own observations. About 10.5 percent of patients were missing or refused to answer the race question (Table 1). This figure compares favorably to averages based on the 56 health care institutions in the National COVID Collaborative Cohort (N3C), in which 11.3% of patients were missing data on race and an additional 8% were recorded as refusals (Cook et al., 2022). Although included in our analyses, a detailed assessment of missing data is beyond the scope of this research note. We return to this in our discussion.
Table 1.
Descriptive statistics for EHR sample (weighted)
| Percent | Standard error | |
|---|---|---|
| Total | 100 | |
| Assigned a PIK | 97.75 | 0.0276 |
| Matched to ACS (Conditional on Receiving a PIK) |
16.87 | 0.1367 |
| Sex | ||
| Male | 41.62 | 0.1730 |
| Female | 58.38 | 0.1730 |
| Birth Year | ||
| Before 1944 | 4.979 | 0.0815 |
| 1945-1949 | 10.09 | 0.1111 |
| 1950-1954 | 11.35 | 0.1152 |
| 1955-1959 | 11.72 | 0.1149 |
| 1960-1964 | 11.70 | 0.1131 |
| 1965-1969 | 10.91 | 0.1080 |
| 1970-1974 | 10.25 | 0.1034 |
| 1975-1979 | 9.200 | 0.0968 |
| 1980-1984 | 9.018 | 0.0957 |
| 1985 and later | 10.77 | 0.1036 |
| Race | ||
| White | 61.71 | 0.1464 |
| Black | 19.05 | 0.1064 |
| AIAN | 0.473 | 0.0085 |
| Asian | 1.843 | 0.0154 |
| Other | 6.373 | 0.0376 |
| Missing/Refused | 10.54 | 0.0565 |
| Ethnicity | ||
| Hispanic or Latino | 5.584 | 0.0350 |
| Not Hispanic or Latino | 84.18 | 0.0737 |
| Other/Missing | 10.23 | 0.0572 |
| Language | ||
| English | 93.12 | 0.0491 |
| Spanish | 3.356 | 0.0265 |
| Unknown | 2.814 | 0.0357 |
| Other | 0.7075 | 0.0164 |
| Health Insurance* | ||
| Any Private (Employer, State, Tricare) | 59.68 | 0.1703 |
| Any Medicare | 20.32 | 0.1461 |
| Any Medicaid/Gov’t Assistance | 10.55 | 0.1004 |
| Any Uninsured/Self-Pay | 29.60 | 0.1547 |
| Residence** | ||
| In-state | 67.30 | 0.1671 |
| Out-of-state | 2.501 | 0.0558 |
| Missing | 30.20 | 0.1639 |
Source: Electronic Health Records (2016-2019)
Note:
Insurance categories are not mutually exclusive.
Residence is based on address provided at most recent visit.
AIAN = American Indian and Alaska Native. DMS #P-7519212.
All results were approved for release by the Disclosure Review Board of the U.S. Census Bureau, authorization number CBDRB-FY23-SEHSD003-021 and CBDRB-FY23-POP001-0074. All numeric values were rounded according to U.S. Census Bureau disclosure protocols to preserve data privacy.
We drew a disproportionate stratified random sample of about 200,000 patients aged 25-74 with at least two visits (e.g., hospitalization, in-patient visit, or out-patient visit) between 2016 and 2019 (Udalova et al., 2022). We oversampled patients who identified as Black, Asian, and Hispanic as well as those for whom race and ethnicity was missing to assess more precise estimates for these groups. Table 1 shows descriptive statistics for the sample drawn, weighted to account for the stratified sample design. Weights were constructed based on the racial and ethnic composition of the original EHR population. Patients in the weighted sample predominantly identify as White (61.71%) or Black (19.05%); smaller fractions identify as AIAN, Asian, or Hispanic (0.47%, 1.84%, and 5.58%, respectively) (Table 1). For this analysis, patients identifying as NHPI were grouped with Other Race.
Structured EHRs were transferred via secure means to the Census Bureau IT environment and a Protected Identification Key (PIK) was assigned.2 A PIK is a unique anonymized person identifier used at the Census Bureau to link surveys and administrative records. Personally Identifiable Information (PII) from EHRs is passed through successive modules of the Person Identification Validation System (PVS), comparing Social Security number (SSN), address, name, sex, and full date of birth to a reference file maintained at the Census Bureau (Wagner & Layne, 2014). When a linkage could be made between the incoming record and the reference file, a PIK was appended to the patient record. The match to SSN is exact; matches involving the other identifiers are probabilistic (for additional information regarding PIK rates, see Udalova et al., 2022). Once a PIK was assigned, SSN and name were dropped, leaving information on sex, birth year, race, ethnicity, language, health insurance, and broad categories of residence for analysis. No health information was included in the transfer.
Our first interest was in whether the success of PIK assignment depended on race and ethnicity. There are a couple of reasons why it might (Bond et al., 2014). First, some patients may not be included in the federal and state data sources on which the reference file is based, e.g., recent immigrants. In North Carolina, the Hispanic population has increased rapidly, accounting for almost 10 percent of the overall population in 2019, the end of our window (U.S. Census Bureau, 2019a). Immigration played a major role in this change: about 40 percent of the Hispanic population was foreign-born in 2019 (U.S. Census Bureau, 2019b). A substantial portion of all immigrants in North Carolina were undocumented in 2016 (39%), most of which (56%) were from Mexico (Pew Research Center, 2019). Undocumented immigrants may purposely avoid being included in government data. Second, patients may be included in federal data sources but discrepancies between these sources and EHRs in the way information is recorded may undermine the match.
Once a PIK is assigned to the EHRs, we attempted to match to ACS data between 2001 and 2017. There are three reasons why we might fail to make a match. First, ACS data pertain to a sample of the state population, about 1-1.5% per year. An EHR that received a PIK might not match an ACS record because that person was not part of the ACS sample in any of the study years. Without knowing the population of eligible potential matches, it is not possible to say what the match rate should be, but it likely lies between 12 and 18 percent (Udalova et al., 2022). Match rates will likely be lower for recently arrived immigrants as they are less likely to have been included in the early years of the ACS. Second, there are longstanding differences in the coverage of racial and ethnic groups in the ACS. The Census Bureau measures coverage rates as the ratio of the ACS population of a group to an independent estimate for that group, times 100 (U.S. Census Bureau, 2022d). In 2017, the end of our observation window for the study’s use of ACS data, national coverage rates were 94.9 percent for White Non-Hispanic, 82.5 percent for Black Non-Hispanic, and 86.9 percent for Hispanic individuals (U.S. Census Bureau, 2022a). Although aggregate ACS population estimates are adjusted for under-coverage, some individuals who should have been included will be missing and not available for a match. Third and finally, only ACS records that have received a PIK can be matched to EHRs with PIKs. The challenges with PIK assignments noted for EHRs also apply to PIK assignments for the ACS (Bond et al., 2014).
Results
Table 2 shows unadjusted PIK and ACS match rates for racial and ethnic groups as well as for other social attributes available in the EHRs. Table 3 presents the unadjusted PIK and ACS match rates for racial and ethnic groups from Table 2 alongside PIK and ACS match rates adjusted for sex, birth cohort, language, health insurance, and residence. The adjusted rates are predicted probabilities from logistic regression models for each outcome, weighted to account for sample design. Of interest is the degree to which disparities narrow when we account for the (limited) information in the EHRs relevant to access and utilization. Full results are shown in Appendix Table A.1, which reports regression coefficients from linear probability models (LPM) and average marginal effects (AME) for logistic models. To aid in interpretation, Table 3 also includes information about national ACS PIK and coverage rates available from published sources. While we focus our interpretation of results on White, Black, and Hispanic patients, all groups are included in the tables.
Table 2.
Number and percent of patients receiving a PIK or matched to ACS (conditional on receiving a PIK) across EHR demographic characteristics (unweighted)
| Assigned a PIK | Matched to ACS, Conditional on Receiving a PIK | |||
|---|---|---|---|---|
| N | Row % | N | Row % | |
| Total | 187,000 | 93.97 | 29,000 | 15.51 |
| Race | ||||
| White | 55,000 | 98.65 | 9,800 | 17.82 |
| Black | 32,500 | 99.54 | 4,700 | 14.46 |
| AIAN | 3,200 | 98.46 | 600 | 18.75 |
| Asian | 16,500 | 96.77 | 2,100 | 12.73 |
| Other | 30,500 | 77.61 | 3,400 | 11.15 |
| Missing/Refused | 49,000 | 96.84 | 8,200 | 16.73 |
| Ethnicity | ||||
| Hispanic or Latino | 23,500 | 71.65 | 2,400 | 10.21 |
| Not Hispanic or Latino | 111,000 | 98.49 | 17,500 | 15.77 |
| Other/Missing | 52,000 | 98.30 | 9,000 | 17.31 |
| Sex | ||||
| Male | 76,000 | 94.53 | 11,500 | 15.13 |
| Female | 110,000 | 93.62 | 17,000 | 15.45 |
| Birth Year | ||||
| Before 1944 | 7,400 | 98.01 | 1,500 | 20.27 |
| 1945-1949 | 15,500 | 98.10 | 3,100 | 20.00 |
| 1950-1954 | 18,500 | 97.63 | 3,500 | 18.92 |
| 1955-1959 | 20,500 | 97.16 | 3,600 | 17.56 |
| 1960-1964 | 21,500 | 96.63 | 3,500 | 16.28 |
| 1965-1969 | 21,000 | 94.59 | 3,200 | 15.24 |
| 1970-1974 | 20,500 | 91.93 | 2,900 | 14.15 |
| 1975-1979 | 19,500 | 89.45 | 2,500 | 12.82 |
| 1980-1984 | 19,000 | 89.20 | 2,300 | 12.11 |
| 1985 and later | 23,000 | 92.00 | 2,900 | 12.61 |
| Language | ||||
| English | 160,000 | 98.40 | 26,000 | 16.25 |
| Spanish | 11,000 | 56.41 | 850 | 7.73 |
| Unknown | 11,000 | 96.49 | 1,800 | 16.36 |
| Other | 4,200 | 91.30 | 350 | 8.33 |
| Health Insurance* | ||||
| Any Private (Employer, State, Tricare) | 115,000 | 98.63 | 18,500 | 16.09 |
| Any Medicare | 45,500 | 99.34 | 8,500 | 18.68 |
| Any Medicaid/Gov’t Assistance | 29,500 | 90.49 | 3,600 | 12.20 |
| Any Uninsured/Self-Pay | 58,500 | 85.40 | 7,700 | 13.16 |
| Residence | ||||
| In-state | 129,000 | 93.48 | 19,000 | 14.73 |
| Out-of-state | 4,500 | 95.74 | 750 | 16.67 |
| Missing | 53,500 | 95.20 | 9,400 | 17.57 |
Source: Electronic Health Records (2016-2019) and American Community Survey data (2001-2017)
Note:
Insurance categories are not mutually exclusive.
AIAN = American Indian and Alaska Native. DMS #P-7519212.
All results were approved for release by the Disclosure Review Board of the U.S. Census Bureau, authorization number CBDRB-FY21-POP001-0087. All numeric values were rounded according to U.S. Census Bureau disclosure protocols to preserve data privacy.
Table 3.
Comparison of PIK and ACS match rates
| PIK rate, EHR (2016-2019) | ACS (2001-2017) match rate, conditional on receiving a PIK | Bond et al. 2014 | ACS Coverage Rates | |||||
|---|---|---|---|---|---|---|---|---|
|
|
||||||||
| Unadjusted and unweighted | Adjusted and weighted | Unadjusted and unweighted | Adjusted and weighted | Unadjusted PIK rate, ACS 2010 | National, 2017 | |||
| Race | ||||||||
| White | 98.65 | 99.67 | (99.62, 99.72) | 17.82 | 17.39 | (17.00, 17.77) | 93.45 | 94.9 |
| Black | 99.54 | 99.69 | (99.64, 99.74) | 14.46 | 14.53 | (14.13, 14.92) | 91.4 | 82.5 |
| AIAN | 98.46 | 99.39 | (99.20, 99.58) | 18.75 | 18.49 | (17.12, 19.85) | 91.03 | 91.5 |
| Asian | 96.77 | 98.22 | (97.98, 98.46) | 12.73 | 14.03 | (13.42, 14.63) | 90.81 | 92.6 |
| Other | 77.61 | 98.88 | (98.75, 99.01) | 11.15 | 14.7 | (14.14, 15.25) | 84.88 | - |
| Missing/Refused* | 96.84 | 98.98 | (98.87, 99.10) | 16.73 | 16.71 | (16.26, 17.16) | - | - |
| Ethnicity | ||||||||
| Hispanic/Latino | 71.65 | 98.63 | (98.46, 98.81) | 10.21 | 14.56 | (13.85, 15.26) | 87.13 | 86.9 |
| Not Hispanic/Latino | 98.49 | 99.63 | (99.58, 99.67) | 15.77 | 16.59 | (16.30, 16.87) | 93.64 | - |
| Other/Missing | 98.30 | 99.55 | (99.49, 99.61) | 17.31 | 16.92 | (16.42, 17.43) | - | - |
Source: Electronic Health Records (2016-2019) and American Community Survey data (2001-2017)
Note: EHR sample is 199,000; 95% confidence intervals are in parentheses; The unadjusted PIK rates for 2010 ACS were reported in Bond et al. (2014). Coverage rates refer to combined race/ethnicity: non-Hispanic White, non-Hispanic Black, non-Hispanic AIAN, non-Hispanic Asian, and Hispanic; DMS #P-7519212. All results were approved for release by the Disclosure Review Board of the U.S. Census Bureau, authorization number CBDRB-FY21-POP001-0087 and CBDRB-FY23-POP001-0053. All numeric values were rounded according to U.S. Census Bureau disclosure protocols to preserve data privacy.
There is no missing data for ACS information as this information is imputed.
Beginning with unadjusted PIK rates, almost all patients who identified as White (98.65%) or Black (99.54%) were assigned PIKs (Table 3). These patients were well covered in administrative data systems, and the information provided to the health care system was of sufficiently high quality for PIK assignment. In contrast, only 77.61 percent of patients who identified their race as something “other” than the categories provided in the EHRs were assigned PIKs. With respect to ethnicity, 71.65 percent of patients who identified as Hispanic received PIKs, in contrast to a high degree of success for patients who did not identify as Hispanic (98.49%). Although broadly similar, we made two observations when comparing these patterns to unadjusted PIK rates for the 2010 ACS (Bond et al., 2014). First, PIK rates were higher for White and Black patients in the EHRs than for the population generally, probably because SSNs were available for many (although not all) of them. Second, PIK rates were lower for “other” race and Hispanic patients in our study, suggesting something distinctive about the population of patients in our EHRs relative to the national sample.
Next, we investigated the likelihood of receiving a PIK adjusting for age, sex, race, ethnicity, language, and insurance status as reported in the EHRs. Table 3 reports predicted probabilities based on logistic regression estimates, with all other variables held at their means. All adjusted PIK rates were quite high, with differences substantially narrowed relative to the unadjusted rates. Adjusted PIK rates for “other” race and for Hispanic patients were within 2 percentage points of the other racial and ethnic categories, indicating the importance of language, health insurance and to a lesser extent age and residence. These results suggest that a substantial portion of the difference in the unadjusted PIK rates reflect differences in health care access among racial and ethnic subpopulations.
Once PIKs are assigned to EHRs, the next step is to match them to individuals in the ACS who have been assigned PIKs. Unadjusted conditional ACS match rates were 17.82 percent for White patients,14.46 percent for Black patients, and 10.21 percent for Hispanic patients. Adjustments for sex, birth cohort, health insurance, and residence were of little consequence to conditional ACS match rates for White and Black patients but make a substantial difference for Hispanic patients. For Hispanic patients, the adjusted conditional ACS match rate was almost 40 percent higher than the unadjusted rate, again suggesting the importance of access and utilization. That there is not more of an improvement for Black patients is likely due to under-coverage in the ACS (Table 3). Although the ACS coverage rates refer to a combined race/ethnicity category rather than separate categories, they are nevertheless instructive. ACS coverage rates for 2017 were highest for Non-Hispanic White individuals (94.9%), lower for Hispanic individuals (86.9%), and lowest for Non-Hispanic Black individuals (82.5%). Additionally, the adjusted ACS match rates in Table 3 were highest for White patients and lowest for Black and Hispanic patients. It is important to remember that population estimates based on ACS data adjust for coverage.
Discussion
Prevalence is measured relative to a population at risk. Research that draws on EHRs to describe health disparities typically uses the total number of patients of a specific race or ethnicity for the denominator. However, because the patients included in EHRs may be differentially selected from the source population, these estimates are vulnerable to bias (Goldstein et al., 2016). Prevalence may be overstated if representation is better in the numerator than the denominator. If the extent of this bias differs by race or ethnicity, the interpretation of health disparities may be misestimated, which has implications for the use of EHRs for population research. The goal of this research note was to reflect on this potential bias by using information gleaned from linking EHRs from a large integrated health delivery system in North Carolina to ACS microdata.
The first step in this linkage, PIK assignment, provided information about the quality of personal information in the EHRs and patient representation in government data systems. In terms of our results, PIK assignments were successful. Availability of SSNs in many of the health records facilitated these assignments, helping to explain why PIK rates for the EHRs were higher than for the 2010 ACS. The exceptions were for patients who identified as “other” race and Hispanic ethnicity in the EHRs. Unadjusted PIK rates for these groups were 77.61 percent and 71.65 percent, respectively. The disproportionate representation of recent immigrants could help explain why PIK rates are so low for Hispanic patients. Problems with PIK assignments for Hispanic patients in our sample may also relate to documentation status.3 More than half of undocumented immigrants in North Carolina are Hispanic (Pew Research Center, 2019). Underutilization of health services among people with undocumented status is well known (Cabral & Cuevas, 2020), possibly for that reason but also due to lack of insurance (except in the case of pregnancy), but those urgently needing medical care may opt to take the risk. EHRs thus may include more undocumented immigrants than government sources, especially those in poor health.
The second step, matching to the ACS conditional on PIK assignment, provided information about representation in the ACS. Our procedures consider ACS data over a large interval, 2001-2017. A conditional ACS match rate that falls short of the average might be due to changes in racial and ethnic composition over that period. In North Carolina, the Hispanic population has increased, more than doubling between 2000 and 2010 and increasing by a further 40 percent the following decade (U.S. Census Bureau, 2020). Because the Hispanic population was smaller in earlier years of the ACS, we expected conditional ACS match rates to be lower as a result, and they were.4
Conditional ACS match rates also depend on ACS coverage, measures of which reflect inclusion of addresses in the frame as well as response rates for those found. Despite best efforts, and the fact that participation is mandatory, the ACS falls short of complete coverage. Challenges associated with enumerating the Black population are long standing and well documented (O’Hare, 2019). Lower coverage for the Hispanic population may reflect undocumented status in part. Not only are people with undocumented status less likely to participate in federal data collections, those with whom they live may also avoid participation, even those who are documented or citizens (Hall et al., 2019; Kopparam, 2022).
What does this mean for health disparities research based on EHRs? Our results suggest that differences between disease prevalence rates for Hispanic individuals relative to other groups could be misrepresented by biases in the EHRs. On one hand, our results suggest that undocumented immigrants may be better represented in EHRs than in government systems. Yet, this representation is likely selective, related to the need for health care. Compounding the problem is that a large proportion of Hispanic individuals are uninsured, almost three times that of the state population (31% compared to 11% in 2019, respectively) (North Carolina Justice Center, 2020). Using ACS data to adjust the denominator of the prevalence rate would be an improvement but would not fully correct the problem given that Hispanic individuals missing from government records are less likely to participate in the ACS as well.
The situation is different for Black patients. It is possible that lack of trust and other characteristics that reduce the participation of Black individuals in the ACS also lead these individuals to avoid interacting with the health care system. However, this seems unlikely. Based on high PIK rates, it appears that Black patients are part of government data collection more broadly. Further, the fraction of uninsured for Black individuals is about the same as for North Carolina as a whole (12% compared to 11%) (North Carolina Justice Center, 2020). Moreover, according to BRFSS data for 2019, Black North Carolinians saw a doctor in the past 12 months for routine medical care (checkups) at a rate roughly equivalent to White North Carolinians (84% compared to 80%) and substantially more than Hispanic North Carolinians (64%) (NC State Center for Health Statistics, 2019). In addition to differences in insurance coverage, differences in age structure and the healthy migrant effect for first generation Hispanic individuals help to explain this difference. Given all of this, using ACS data to correct the denominator of prevalence rates for potential underrepresentation seems like a reasonable approach for Black patients.
Of course, our analysis suffers from some limitations. The data come from a single health system, which while large, is one of several in the state of North Carolina. Its mission is to serve the health needs of all North Carolinians, so we have compared patients to the population of the state, but residents do have alternatives and coverage is far from complete. Second, our analysis assumed that EHR-based information about race and ethnicity reflected patient identities. The information was missing for about 10 percent of our sample, a significant fraction even if low compared to many other health systems. A detailed discussion of these patients is beyond the scope of this research note, but we would like to point out that PIK and conditional ACS match rates for this group were quite high. Third, the results may reflect some of the peculiarities of North Carolina, perhaps especially that the recent growth in the Hispanic population and the role immigration is playing in this growth. While our results may not completely generalize to other health systems or states, there are larger lessons learned, particularly about potential bias in the EHRs, the value of linking EHRs to external sources for assessment purposes, and finally the possibility of leveraging EHRs for research on social determinants and population health.
Acknowledgements:
We are grateful for support from the Carolina Population Center (NICHD P2C HD050924), the UNC Translational and Clinical Sciences (TraCS) Institute (CTSA UL1TR002489), and the Enhancing Health Data (EHealth) program at the U.S. Census Bureau (census.gov/ehealth). This paper is to inform interested parties of ongoing research and to encourage discussion. Any opinions and conclusions expressed herein are those of the authors and do not reflect the views of the U.S. Census Bureau. All results were approved for release by the Disclosure Review Board of the U.S. Census Bureau, authorization numbers CBDRB-FY21-POP001-0087, CBDRB-FY23-SEHSD003-021, CBDRB-FY23-POP001-0074, CBDRB-FY23-POP001-0053, and CBDRB-FY22-POP001-0141. All numeric values were rounded according to U.S. Census Bureau disclosure protocols to preserve data privacy. Any opinions and conclusions expressed herein are those of the authors and do not reflect the view of the U.S. Census Bureau.
Appendix A
Table A.1.
Likelihood of Patient Receiving a PIK or Being Matched to ACS (Weighted)
| Panel A: Likelihood of patient receiving a PIK |
Panel B: Likelihood of patient being matched to ACS (Conditional on receiving a PIK) |
||||
|---|---|---|---|---|---|
| LPM | Logit (AME) | LPM | Logit (AME) | ||
| (1) | (2) | (3) | (4) | ||
| Race (White omitted) | |||||
| Black | .001739** | .0005078 | −.0287*** | −.02876*** | |
| (.000543) | (.001028) | (0.002887) | (.002888) | ||
| American Indian or Alaska Native | −.003379 | −.00775** | 0.01026 | .01103 | |
| (.002138) | (.002469) | (0.007168) | (.007257) | ||
| Asian | −.01966*** | −.03095*** | −.03355*** | −.03382*** | |
| (.001531) | (.002023) | (0.003646) | (.003724) | ||
| Other | −.0361*** | −.01902*** | −.02484*** | −.02707*** | |
| (.002024) | (.0009497) | (0.003147) | (.003415) | ||
| Missing/Refused | −.01107*** | −.0169*** | −.007349** | −.006844* | |
| (.001005) | (.001037) | (0.002822) | (.002837) | ||
| Ethnicity (Not Hispanic or Latino omitted) | |||||
| Hispanic or Latino | −.04667*** | −.02362*** | −.01835*** | −.02045*** | |
| (.002696) | (.001326) | (0.003482) | (.00384) | ||
| Other/Missing | .001803 | −.002226** | 0.003325 | .003391 | |
| (.0009239) | (.000801) | (0.00282) | (.002838) | ||
| Sex (male omitted) | −.000552 | .0002036 | .01002*** | .01002*** | |
| (.0005158) | (.0005224) | (0.002776) | (.002787) | ||
| Birth Year (1944 and earlier omitted) | |||||
| 1945-1949 | .0008198 | .004613 | −0.01222 | −.01182 | |
| (.001315) | (.003266) | (0.008432) | (.008172) | ||
| 1950-1954 | −.006132*** | .004192 | −0.01586 | −.01584 | |
| (.001383) | (.003385) | (0.008399) | (.008209) | ||
| 1955-1959 | .000422 | .01009** | −.02243* | −.02151* | |
| (.001559) | (.003381) | (0.008895) | (.008913) | ||
| 1960-1964 | .003996** | .01316*** | −.03238*** | −.0312*** | |
| (.001527) | (.00327) | (0.008932) | (.008968) | ||
| 1965-1969 | .003883* | .01131*** | −.04566*** | −.04446*** | |
| (.001598) | (.003243) | (0.009023) | (.009056) | ||
| 1970-1974 | −.001623 | .00726* | −.05629*** | −.05523*** | |
| (.001678) | (.003263) | (0.009062) | (.009084) | ||
| 1975-1979 | −.007128*** | .004718 | −.05859*** | −.05781*** | |
| (.001726) | (.003263) | (0.009237) | (.00928) | ||
| 1980-1984 | −.008131*** | .003916 | −.06681*** | −.06648*** | |
| (.001729) | (.003258) | (0.009208) | (.009228) | ||
| 1985+ | −.002657 | .006345 | −.05484*** | −.05427*** | |
| (.001725) | (.003256) | (0.009138) | (.009209) | ||
| Language (English omitted) | |||||
| Spanish | −.3303*** | −.03884*** | −.04408*** | −.06067*** | |
| (.004517) | (.001803) | (0.003934) | (.004368) | ||
| Unknown | −.01176*** | −.005569*** | −0.003828 | −.00387 | |
| (.001803) | (.001121) | (0.005517) | (.005528) | ||
| Other | −.0523*** | −.02448*** | −.06348*** | −.07045*** | |
| (.005556) | (.002792) | (0.008197) | (.009028) | ||
| Health Insurance* | |||||
| Any private insurance | .03612*** | .02605*** | .0171*** | .0182*** | |
| (.0009549) | (.0009136) | (0.003709) | (.003818) | ||
| Any had Medicare | .02602*** | .02604*** | .01136* | .0119* | |
| (.001189) | (.002181) | (0.005064) | (.004976) | ||
| Any Medicaid | .01676*** | .006137*** | −.01455*** | −.01656*** | |
| (.00104) | (.0006143) | (0.003844) | (.004362) | ||
| Any uninsured/self-pay | −.01398*** | −.007856*** | −.01009** | −.011*** | |
| (.0006188) | (.0008687) | (0.003141) | (.003285) | ||
| Insurance data missing | .02151*** | .004712 | −0.001292 | −.0008151 | |
| (.005751) | (.002915) | (0.01929) | (.02151) | ||
| Residence (In-state omitted) | |||||
| Out-of-state | −.005473* | −.007301** | −0.0001316 | −.0001431 | |
| (.002421) | (.002758) | (0.008889) | (.008683) | ||
| Missing residence | −.004614*** | −.003478*** | .0109*** | .01043*** | |
| (.0005306) | (.0006258) | (0.003179) | (.0031) | ||
| Constant | .9709*** | .198*** | |||
| (.001742) | (0.009038) | ||||
| R-squared | 0.2921 | 0.008917 | |||
| N | 199,000 | 199,000 | 187,000 | 187,000 | |
p<.05,
p<.01,
p<.001; two-tailed test
Source: Electronic Health Records (2016-2019) and American Community Survey data (2001-2017)
Note:
Insurance categories are not mutually exclusive.
AIAN = American Indian and Alaska Native, AME = average marginal effect, LPM = linear probability model.
Standard errors are in parentheses. DMS #P-7519212. All results were approved for release by the Disclosure Review Board of the U.S. Census Bureau, authorization numbers CBDRB-FY22-POP001-0141 and CBDRB-FY23-SEHSD003-021. All numeric values were rounded according to U.S. Census Bureau disclosure protocols to preserve data privacy
Footnotes
The total number of housing units included in ACS average around 2 million nationally (with the exception of 2020) and 60,000 for the state of North Carolina (U.S. Census Bureau, 2022c). Persons per household between 2018 and 2022 was 2.57 nationally and 2.48 for North Carolina(U.S. Census Bureau, 2023). Total number of housing units and persons per household were multiplied to get the number of individuals included in each year of ACS.
This research is the result of a collaboration between researchers at the University of North Carolina at Chapel Hill (UNC) and the Enhancing Health Data (EHealth) Program at the U.S. Census Bureau. The EHealth Program (census.gov/ehealth) partners with health data organizations to produce high-quality statistics and research related to population health. Title 13 of the U.S. Code authorizes the Census Bureau to collect information from other entities and requires the Census Bureau to keep the information confidential and to use it only for statistical purposes. If you are interested in using existing Census Bureau restricted microdata, please see the FSRDC website: https://www.census.gov/about/adrm/fsrdc.html.
If immigration status were the primary explanation for these low PIK rates, we would expect similarly low PIK rates for Asian patients, who are on average even more recent arrivals in North Carolina. Unadjusted PIK rates for Asian patients are very high, only two percentage points lower than for White patients (Table 3).
The Asian population has also increased dramatically, especially between 2010 and 2020, (64%) (U.S. Census Bureau, 2020). The conditional ACS match rate for Asian patients in the EHRs is the lowest of any group in our sample (14%) (Table 3).
Contributor Information
Aubrey Limburg, U.S. Census Bureau.
Jordan Young, University of North Carolina at Chapel Hill.
Tim Carey, University of North Carolina at Chapel Hill School of Medicine.
Paul Chelminski, University of North Carolina at Chapel Hill School of Medicine.
Victoria M. Udalova, U.S. Census Bureau
Barbara Entwisle, University of North Carolina at Chapel Hill.
References
- Boehmer TK, Koumans EH, Skillen EL, Kappelman MD, Carton TW, Patel A, August EM, Bernstein R, Denson JL, Draper C, Gundlapalli AV, Paranjape A, Puro J, Rao P, Siegel DA, Trick WE, Walker CL, & Block JP (2022). Racial and Ethnic Disparities in Outpatient Treatment of COVID-19—United States, January-July 2022. Morbidity and Mortality Weekly Report, 71(43), 1359–1365. 10.15585/mmwr.mm7143a2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bond B, Brown JD, Luque A, & O’Hara A (2014). The nature of the bias when studying only linkable person records: Evidence from the American Community Survey (Center for Administrative Records Research and Applications Working Paper CARRA-WP-2014-08; p. 30). U.S. Census Bureau. [Google Scholar]
- Bower JK, Patel S, Rudy JE, & Felix AS (2017). Addressing Bias in Electronic Health Record-Based Surveillance of Cardiovascular Disease Risk: Finding the Signal Through the Noise. Current Epidemiology Reports, 4(4), 346–352. 10.1007/s40471-017-0130-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cabral J, & Cuevas AG (2020). Health inequities among Latinos/Hispanics: Documentation status as a determinant of health. Journal of Racial and Ethnic Health Disparities, 7(5), 874. 10.1007/s40615-020-00710-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casey JA, Schwartz BS, Stewart WF, & Adler NE (2016). Using electronic health records for population health research: A review of methods and applications. Annual Review of Public Health, 37(1), 61–81. 10.1146/annurev-publhealth-032315-021353 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cook L, Espinoza J, Weiskopf NG, Mathews N, Dorr DA, Gonzales KL, Wilcox A, Madlock-Brown C, & Consortium N (2022). Issues with Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave. JMIR Medical Informatics, 10(9), e39235. 10.2196/39235 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foster TB, Fernandez L, Porter SR, & Pharris-Ciurej N (2024). Racial and Ethnic Disparities in Excess All-Cause Mortality in the First Year of the COVID-19 Pandemic. Demography, 61(1), 59–85. 10.1215/00703370-11133943 [DOI] [PubMed] [Google Scholar]
- Friedman DJ, Parrish RG, & Ross DA (2013). Electronic Health Records and US Public Health: Current Realities and Future Promise. American Journal of Public Health, 103(9), 1560–1567. 10.2105/AJPH.2013.301220 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldstein BA, Bhavsar NA, Phelan M, & Pencina MJ (2016). Controlling for Informed Presence Bias Due to the Number of Health Encounters in an Electronic Health Record. American Journal of Epidemiology, 184(11), 847–855. 10.1093/aje/kww112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Groves RM (2006). Nonresponse Rates and Nonresponse Bias in Household Surveys. Public Opinion Quarterly, 70(5), 646–675. 10.1093/poq/nfl033 [DOI] [Google Scholar]
- Hall M, Musick K, & Yi Y (2019). Living Arrangements and Household Complexity among Undocumented Immigrants. Population and Development Review, 45(1), 81–101. [Google Scholar]
- Huennekens K, Oot A, Lantos E, Yee LM, & Feinglass J (2020). Using Electronic Health Record and Administrative Data to Analyze Maternal and Neonatal Delivery Complications. The Joint Commission Journal on Quality and Patient Safety, 46(11), 623–630. 10.1016/j.jcjq.2020.08.007 [DOI] [PubMed] [Google Scholar]
- Klompas M, Cocoros NM, Menchaca JT, Erani D, Hafer E, Herrick B, Josephson M, Lee M, Payne Weiss MD, Zambarano B, Eberhardt KR, Malenfant J, Nasuti L, & Land T (2017). State and Local Chronic Disease Surveillance Using Electronic Health Record Systems. American Journal of Public Health, 107(9), 1406–1412. 10.2105/AJPH.2017.303874 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kopparam R (2022, October 14). What federal statistical agencies can do to improve survey response rates among Hispanic communities in the United States. Equitable Growth. https://equitablegrowth.org/what-federal-statistical-agencies-can-do-to-improve-survey-response-rates-among-hispanic-communities-in-the-united-states/ [Google Scholar]
- Ma A, Sanchez A, & Ma M (2022). Racial disparities in health care utilization, the affordable care act and racial concordance preference. International Journal of Health Economics and Management, 22(1), 91–110. 10.1007/s10754-021-09311-8 [DOI] [PubMed] [Google Scholar]
- Mahajan S, Caraballo C, Lu Y, Valero-Elizondo J, Massey D, Annapureddy AR, Roy B, Riley C, Murugiah K, Onuma O, Nunez-Smith M, Forman HP, Nasir K, Herrin J, & Krumholz HM (2021). Trends in Differences in Health Status and Health Care Access and Affordability by Race and Ethnicity in the United States, 1999-2018. JAMA, 326(7), 637–648. 10.1001/jama.2021.9907 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer BD, Mok WKC, & Sullivan JX (2015). Household Surveys in Crisis. Journal of Economic Perspectives, 29(4), 199–226. 10.1257/jep.29.4.199 [DOI] [Google Scholar]
- Montez JK, Zajacova A, Hayward MD, Woolf SH, Chapman D, & Beckfield J (2019). Educational Disparities in Adult Mortality Across U.S. States: How Do They Differ, and Have They Changed Since the Mid-1980s? Demography, 56(2), 621–644. 10.1007/s13524-018-0750-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- NC State Center for Health Statistics. (2019). 2019 BRFSS Survey Results: North Carolina Health Care Access. NCDHHS Division of Public Health. https://schs.dph.ncdhhs.gov/data/brfss/2019/nc/all/checkup1.html [Google Scholar]
- North Carolina Justice Center. (2020, June 6). North Carolina’s overall uninsured rate masks stark differences across racial and ethnic groups. North Carolina Justice Center. https://www.ncjustice.org/publications/north-carolinas-overall-uninsured-rate-masks-stark-differences-across-racial-and-ethnic-groups/ [Google Scholar]
- O’Hare WP (2019). Differential Undercounts in the U.S. Census: Who is Missed? Springer International Publishing. 10.1007/978-3-030-10973-8 [DOI] [Google Scholar]
- Pew Research Center. (2019, February 5). U.S. unauthorized immigrant population estimates by state, 2016. Pew Research Center’s Hispanic Trends Project. https://www.pewresearch.org/hispanic/interactives/u-s-unauthorized-immigrants-by-state/ [Google Scholar]
- Rumball-Smith J, & Bates DW (2018). The Electronic Health Record and Health IT to Decrease Racial/Ethnic Disparities in Care. Journal of Health Care for the Poor and Underserved, 29(1), 58–62. 10.1353/hpu.2018.0006 [DOI] [PubMed] [Google Scholar]
- Sharifi M, Sequist TD, Rifas-Shiman SL, Melly SJ, Duncan DT, Horan CM, Smith RL, Marshall R, & Taveras EM (2016). The Role of Neighborhood Characteristics and the Built Environment in Understanding Racial/Ethnic Disparities in Childhood Obesity. Preventive Medicine, 91, 103–109. 10.1016/j.ypmed.2016.07.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Udalova V, Carey TS, Chelminski PR, Dalzell L, Knoepp P, Motro J, & Entwisle B (2022). Linking Electronic Health Records to the American Community Survey: Feasibility and Process. American Journal of Public Health, 112(6), 923–930. 10.2105/AJPH.2022.306783 [DOI] [PMC free article] [PubMed] [Google Scholar]
- U.S. Census Bureau. (2019a). American Community Survey 1-Year Estimates: B03003—Hispanic or Latino Origin in North Carolina. https://data.census.gov/table/ACSDT1Y2019.B03003?q=Hispanic%20or%20Latino&g=040XX00US37 [Google Scholar]
- U.S. Census Bureau. (2019b). American Community Survey 1-Year Estimates: S0201—Selected Population Profile in the United States in North Carolina. https://data.census.gov/table/ACSSPP1Y2019.S0201?q=S0201&t=−09&g=040XX00US37 [Google Scholar]
- U.S. Census Bureau. (2020). Decennial Census 2000, 2010, and 2020: Table DP1- Profile of General Population and Housing Characteristics in North Carolina. https://data.census.gov/table/DECENNIALDPSLDH2000.DP1?q=DP1%20North%20Carolina&tid=DECENNIALDPCD110H2000.DP1 [Google Scholar]
- U.S. Census Bureau. (2022a). American Community Survey (ACS) Coverage Rates. https://www.census.gov/acs/www/methodology/sample-size-and-data-quality/coverage-rates/
- U.S. Census Bureau. (2022b). American Community Survey (ACS) response rates. https://www.census.gov/acs/www/methodology/sample-size-and-data-quality/response-rates/
- U.S. Census Bureau. (2022c). American Community Survey: Sample Size. https://www.census.gov/acs/www/methodology/sample-size-and-data-quality/sample-size
- U.S. Census Bureau. (2022d). Coverage Rates Definitions. https://www.census.gov/programs-surveys/acs/methodology/sample-size-and-data-quality/coverage-rates-definitions.html
- U.S. Census Bureau. (2023). U.S. Census Bureau QuickFacts: North Carolina; United States. https://www.census.gov/quickfacts/fact/table/NC,US/HSD310222
- Wagner D, & Layne M (2014). The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications’ (CARRA) record linkage software (CARRA Working Paper Series 2014–01; p. 23). U.S. Census Bureau. [Google Scholar]
- Weiskopf NG, & Weng C (2013). Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. Journal of the American Medical Informatics Association, 20(1), 144–151. 10.1136/amiajnl-2011-000681 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wiltz JL, Feehan AK, Molinari NM, Ladva CN, Truman BI, Hall J, Block JP, Rasmussen SA, Denson JL, Trick WE, Weiner MG, Koumans E, Gundlapalli A, Carton TW, & Boehmer TK (2022). Racial and Ethnic Disparities in Receipt of Medications for Treatment of COVID-19—United States, March 2020-August 2021. Morbidity and Mortality Weekly Report, 71(3), 96–102. 10.15585/mmwr.mm7103e1 [DOI] [PMC free article] [PubMed] [Google Scholar]
