ABSTRACT
Introduction
Many clinical data networks often focus on a single use‐case or disease. By contrast, the TriNetX Dataworks‐USA Network contains real‐world clinical information that can be applied to multiple research questions and use cases. The purpose of this study is to describe the Network's characteristics, as well as its generalizability to the US population, particularly the healthcare‐seeking population.
Methods
Using the Dataworks‐USA Network, a large, regularly updated data network containing de‐identified patient electronic health record (EHR) information from across the United States, basic demographics were summarized and compared to the US Census Bureau International Database (IDB) 2022 data and the National Cancer Institute's version of the Census Bureau's U.S. County Population Data for 2022 to examine the generalizability of the Network.
Results
Patients in the Dataworks‐USA Network are approximately 5 years older than the Census, and the Network has a larger proportion of female patients. The Network has a lower proportion of patients identified as Asian and White race, and a higher proportion who identify as other relative to the Census; other races are similar between the two data sources (< 1% difference). Regionally, Dataworks‐USA has a smaller proportion of patients in all race categories compared with the Census due to the larger proportion of patients of Unknown or Other race.
Conclusions
TriNetX's Dataworks‐USA Network provides a robust data source for many use cases and is broadly generalizable to the US population, particularly the healthcare‐seeking population, with differences related to the underlying nature of the data sources.
Keywords: census, electronic health record, federated data network, generalizability, real‐world data
Summary.
TriNetX Dataworks‐USA Network is a large research network comprised of de‐identified EHR data collected from multiple health systems as a part of routine patient care. The data appear to align well with the overall demographics of the general US population.
These data can support multiple research questions and use cases, such as disease burden, treatment patterns, health outcomes, and comparative medical product safety and effectiveness.
Observed differences between the TriNetX Dataworks‐USA Network and the general US population are largely consistent with prior literature on demographic differences expected in healthcare seeking populations relative to the general population.
1. Introduction
Since the 21st Century Cures Act, use of real‐world data (RWD) to accelerate medical product development [1] and support regulatory decision‐making has become increasingly more common [2, 3, 4]. While randomized clinical trials (RCTs) have long been considered the gold standard for regulatory decisions, RWE studies have become increasingly accepted as supportive evidence when RCTs are not feasible or ethical [5]. Federated data networks, which enable analysis of patient data from multiple organizations while minimizing privacy risk, allow researchers to access larger, richer source populations that can help increase statistical power, draw more precise results, evaluate rare diseases, and deepen insights [6].
In 2014, TriNetX developed an RWD real‐time querying platform that leverages a federated data network from 65 healthcare organizations (HCOs) across the United States (as of July 22, 2024), as well as 182 total HCOs globally (as of July 22, 2024). The TriNetX platform hosts a large collection of de‐identified data, key analytic capabilities, and opportunities for clinical trial site identification, engagement, and participation [7, 8]. TriNetX's platform is Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), and General Personal Data Protection Law (LGPD) compliant [7]. The platform allows clients with user access to real‐time iterative querying access to up‐to‐date federated electronic health record (EHR) data across these HCOs in real‐time through platform queries [7]. It is also possible to download instances of the de‐identified data to analyze offline, or to purchase custom datasets.
Large, multi‐site data networks previously described in the literature are often designed and governed for a single purpose, such as surveillance or the observation of drug effects, or for a single disease or disease area, such as cancer research or COVID‐19 [9, 10, 11, 12]. The TriNetX Dataworks‐USA Network was designed to address a broad set of research questions from clinical trial design optimization to burden of illness, treatment patterns, and outcome studies. The purpose of this study is to describe TriNetX's Dataworks‐USA Network characteristics, as well as its generalizability to the US population. Other TriNetX networks include global data, region‐specific data (e.g., Latin America, European Union), and the Linked network, which links tokenized U.S. EHR data to closed claims and death data.
2. Methods
2.1. Data Source
The Dataworks‐USA Network is comprised of approximately 75% academic medical centers and 25% community hospitals, integrated delivery networks, specialty hospitals, and large specialty physician practices. The Network has detailed clinical information available for over 110 million patients with frequent updates and little to no data lag, as HCOs update their data every 2 weeks on average; though the exact frequency varies by HCO. The Network contains data on medical encounters in the inpatient and outpatient settings that include demographic information, diagnoses recorded, medications administered, prescriptions written, laboratory test results, vital signs, and procedures for each medical encounter and day of a hospital stay. The TriNetX Dataworks‐USA Network includes structured data from all HCOs, as well as unstructured clinical documents available from a subset of HCOs in the network. Geographically, HCOs are well‐distributed across the US. EHR data from the TriNetX Dataworks‐USA Network are generated from routine healthcare encounters within each participating HCO. Patients may receive all or a proportion of their care at the HCO in the TriNetX Dataworks‐USA Network. Healthcare encounters that occur outside the contributing HCO will not be observed.
Extensive data quality procedures are implemented on an ongoing basis. TriNetX data sourcing principles include: liberating all health data, preserving the original data and documenting provenance, harmonizing for interoperability, and actively monitoring quality. The data quality program follows the guidelines developed by Kahn et al. [13] and includes reviews of conformance, completeness, and plausibility; these categories are further separated into two evaluation contexts: validation and verification. All patient data in the TriNetX Network are harmonized to standard terminologies. Clinical facts from the EHR are represented by International Classification of Disease Ninth/Tenth Revision, Clinical Modification (ICD‐9/10‐CM) diagnosis codes; Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System (HCPCS), and International Classification of Diseases, Tenth Revision, Procedure Coding System (ICD‐10‐PCS) procedure codes; RxNorm medication codes organized into Veterans Affairs Therapeutic Class System (VA Class) and Anatomic Therapeutic Chemical (ATC) hierarchies; Fast Healthcare Interoperability Resources—Health Level Seven International Release 4 (HL7 FHIR Release 4) encounter type codes, and Logical Observation Identifiers Names and Codes (LOINC) for lab tests, among others.
2.2. Statistical Analysis
Data from U.S. sites in the TriNetX Dataworks‐USA Network were accessed on July 25, 2024, using SQL queries. Aggregate statistics from each site were retrieved and then further aggregated. Patients' sex, date of birth, race, and ethnicity were obtained directly from this query. If a death record existed in the EHR data, age was calculated as year of birth subtracted from year of death; otherwise, age was calculated as year of birth subtracted from the current year at the time of the analysis, i.e., 2024. Sex, race, and ethnicity all include an Unknown category. Unknown could be due to this information being documented as ‘unknown’ by the HCO, missing when the data was made available to TriNetX by the HCO, or lacking a mapping to the appropriate standard for other reasons. Other race classification is applied to those patients for whom more than one race is specified in the EHR data or for whom the race documented in the EHR is not one of the available options in the TriNetX Dataworks‐USA Network, which are consistent with the HL7 CDC Version 1 standards. This same analysis was then repeated only among patients who had at least one clinical encounter in the past five complete calendar years (2019–2023).
US Census Bureau International Database (IDB) 2022 data was downloaded on February 8, 2024 [14]. Counts of US residents by sex, age, race, and ethnicity were obtained from the downloaded data. Additionally, the National Cancer Institute's version of the Census Bureau's U.S. County Population Data for 2022 was downloaded on June 11, 2024 [15]. From the latter dataset, counts of U.S. residents by county by race were used to calculate regional race estimates.
Visualizations of the HCO and Census statistics were generated using Python data analysis packages (Pandas, Matplotlib, Seaborn). Descriptive statistics were post‐processed using Python data analysis packages (Pandas, NumPy).
3. Results
The TriNetX Dataworks‐USA Network and general U.S. population characteristics are presented in Table 1. The TriNetX Dataworks‐USA Network contains EHR records of over 110 million patients, with most data from 2007 to present, though it is possible that some of these may be duplicates if they receive care at multiple institutions. The U.S. population numbered over 330 million people as of 2022. Over 72 million patients have had at least one encounter in the past five complete calendar years (2019–2023); these patients with a recent encounter documented have a relatively similar sex, age, race, ethnicity, and geographic distribution, though there were fewer patients with an unknown race and unknown ethnicity in this subset (Table 1).
TABLE 1.
Baseline characteristics among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census.
| Baseline characteristics | Network | |||||
|---|---|---|---|---|---|---|
| TNX dataworks‐USA network | ||||||
| All patients | Encounter in the past 5 years | 2022 US census | ||||
| N or mean | % or SD a | N or mean | % or SD a | N or mean | % or SD a | |
| Total N c | 110 731 104 | N/A | 72 938 765 | N/A | 330 708 020 | N/A |
| Demographic characteristics | ||||||
| Age | ||||||
| Mean, SD | 43.9 | 23.5 | 43.2 | 23.7 | 39.1 | 22.9 |
| Categories, n, % | ||||||
| 0–4 | 3 367 166 | 3.0 | 2 965 865 | 4.1 | 18 538 353 | 5.6 |
| 5–17 | 14 730 022 | 13.3 | 9 949 929 | 13.6 | 53 912 474 | 16.3 |
| 18–24 | 9 263 483 | 8.4 | 6 094 486 | 8.4 | 31 328 131 | 9.5 |
| 25–34 | 15 132 179 | 13.7 | 9 604 935 | 13.2 | 45 501 300 | 13.8 |
| 35–44 | 14 909 200 | 13.5 | 9 530 861 | 13.1 | 43 695 365 | 13.2 |
| 45–54 | 12 874 291 | 11.6 | 8 385 387 | 11.5 | 40 431 645 | 12.2 |
| 55–64 | 13 972 084 | 12.6 | 9 406 323 | 12.9 | 42 085 437 | 12.7 |
| 65+ | 26 482 679 | 23.9 | 17 000 979 | 23.3 | 55 215 315 | 16.7 |
| Sex, n, % | ||||||
| Female | 58 576 267 | 52.9 | 39 362 042 | 54.0 | 166 215 285 | 50.3 |
| Male | 51 728 456 | 46.7 | 33 272 283 | 45.6 | 164 492 735 | 49.7 |
| Unknown | 426 381 | 0.4 | 304 440 | 0.4 | 0 | 0.0 |
| Race, n, % | ||||||
| American Indian or Alaskan Native | 345 805 | 0.3 | 252 537 | 0.3 | 4 382 234 | 1.3 |
| Asian | 4 102 745 | 3.5 | 2 976 645 | 4.0 | 20 953 941 | 6.3 |
| Native Hawaiian or Other Pacific Islander | 446 487 | 0.4 | 289 811 | 0.4 | 878 808 | 0.3 |
| Asian or Native Hawaiian or Other Pacific | ||||||
| Islander (4‐race category) b | 4 549 232 | 3.9 | 3 266 456 | 4.4 | 21 832 749 | 6.6 |
| Black or African American | 15 037 371 | 12.9 | 10 701 021 | 14.4 | 45 399 743 | 13.6 |
| White | 64 217 093 | 55.3 | 43 501 760 | 58.3 | 251 602 174 | 75.5 |
| Other | 6 729 316 | 5.8 | 4 423 058 | 5.9 | 10 070 657 | 3.0 |
| Unknown | 25 311 283 | 21.8 | 12 408 459 | 16.6 | 0 | 0.0 |
| Ethnicity, n, % | ||||||
| Hispanic | 11 672 333 | 10.0 | 7 909 567 | 10.6 | 63 655 229 | 19.1 |
| Non‐Hispanic | 61 349 480 | 52.8 | 45 320 050 | 60.8 | 269 632 328 | 80.9 |
| Unknown | 43 168 287 | 37.2 | 21 323 674 | 28.6 | 0 | 0.0 |
| Geographic region, n, % | ||||||
| Midwest | 17 611 847 | 15.9 | 10 677 503 | 15.0 | 68 787 595 | 20.6 |
| Northeast | 32 703 609 | 29.5 | 19 861 902 | 27.9 | 57 040 406 | 17.1 |
| South | 44 312 379 | 40.0 | 30 156 043 | 42.4 | 128 716 192 | 38.6 |
| West | 16 113 086 | 14.6 | 10 413 940 | 14.6 | 78 743 364 | 23.6 |
These values may not add up to 100% for each category due to rounding.
This is an additional category listed in the data from the Census Bureau. For TriNetX data, the corresponding individual categories were summed. This should not be included when summing categories to 100%; as including it will cause the total to exceed 100%.
Patients are unique within a single HCO; however, due to the de‐identified nature of the data, patients who receive care at more than one HCO in the Network will not have their records linked and thus may be counted > 1 time.
The mean age among TriNetX Network patients is several years greater than the general US population (based on the US Census), 43.9 years (±23.5) versus 39.1 years old (±22.9), with a higher proportion of patients 65 or more years old and a smaller proportion of 0–17 year olds (Table 1). Additionally, the TriNetX population has a slightly larger proportion of female patients as compared with the general US population (53% vs. 50%).
Among the entire TriNetX Dataworks‐USA Network population, the proportion of patients who are Black or African American, American Indian or Alaskan Native, and Native Hawaiian or Other Pacific Islander are similar to the proportions of the general US population who identify as each of those races (< 1% difference). The TriNetX Dataworks‐USA Network population has a lower proportion who identified as Asian and White race, and a higher proportion who identify as Other than the general US population. For ethnicity, TriNetX Dataworks‐USA Network population has a lower proportion of both Hispanic and Non‐Hispanic patients than the general US population due to having 37.2% of the population with ethnicity as unknown.
However, focusing on the distribution of race and ethnicity among those who are not Unknown in the TriNetX population, the distribution more closely aligns with that of the Census overall. The following is the distribution of race among TriNetX patients whose race is documented (proportions recalculated to exclude unknown race and ethnicity, respectively): American Indian or Alaskan Native, 0.4%; Asian, 4.5%; Native Hawaiian or Other Pacific Islander, 0.5%; Black or African American, 16.5%, White, 70.7%; Other, 7.4%. For ethnicity: Hispanic, 16.0%; Not Hispanic or Latino, 84.0%.
The TriNetX Dataworks‐USA Network patients are well‐distributed across the United States, though there is a higher proportion of patients from the Northeast and a lower proportion of patients from the West and Midwest as compared with the general US population (Table 1).
Across all regions, the TriNetX Dataworks‐USA Network has a smaller proportion of patients who identified as all known race categories as compared with the US Census, meaning that there are a larger proportion of patients in the TriNetX data who have an Unknown race or identified as Other (Table 2). When race by region is examined only among those whose race is known, the race distribution for all regions more closely resembles that of the US Census (Table 3).
TABLE 2.
Regional race distribution among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census.
| Race category | Network | |||||
|---|---|---|---|---|---|---|
| TNX dataworks‐USA network | ||||||
| All patients | Encounter in the past 5 years | 2022 US census | ||||
| N or mean | % or SD a | N or mean | % or SD a | N or mean | % or SD a | |
| Regional race distribution comparison | ||||||
| Northeast region total | 32 703 609 | 19 861 902 | 57 040 406 | |||
| Race, n, % | ||||||
| American Indian or Alaskan Native | 70 026 | 0.2 | 46 843 | 0.2 | 493 206 | 0.9 |
| Asian or Native Hawaiian or Other Pacific Islander | 1 243 702 | 3.8 | 825 013 | 4.2 | 4 489 260 | 7.9 |
| Black or African American | 2 990 362 | 9.1 | 2 001 982 | 10.1 | 8 297 391 | 14.5 |
| White | 17 878 611 | 54.7 | 11 482 499 | 57.8 | 43 760 549 | 76.7 |
| Other | 1 929 109 | 5.9 | 1 269 962 | 6.4 | 0 | 0.0 |
| Unknown | 8 591 799 | 26.3 | 4 235 603 | 21.3 | 0 | 0.0 |
| Midwest region total | 17 611 847 | 10 677 503 | 68 787 595 | |||
| Race, n, % | ||||||
| American Indian or Alaskan Native | 104 498 | 0.6 | 79 554 | 0.7 | 719 506 | 1.0 |
| Asian or Native Hawaiian or Other Pacific Islander | 453 408 | 2.6 | 330 150 | 3.1 | 2 811 439 | 4.1 |
| Black or African American | 1 869 899 | 10.6 | 1 269 277 | 11.9 | 8 197 362 | 11.9 |
| White | 10 701 345 | 60.8 | 7 215 938 | 67.6 | 57 059 288 | 82.9 |
| Other | 1 091 456 | 6.2 | 633 425 | 5.9 | 0 | 0.0 |
| Unknown | 3 391 241 | 19.3 | 1 149 159 | 10.8 | 0 | 0.0 |
| South region total | 44 312 379 | 30 156 043 | 128 716 192 | |||
| Race, n, % | ||||||
| American Indian or Alaskan Native | 105 811 | 0.2 | 80 517 | 0.3 | 1 532 854 | 1.2 |
| Asian or Native Hawaiian or Other Pacific Islander | 1 259 534 | 2.8 | 949 122 | 3.1 | 5 755 240 | 4.5 |
| Black or African American | 8 811 504 | 19.9 | 6 501 486 | 21.6 | 26 831 217 | 20.8 |
| White | 24 505 588 | 55.3 | 17 343 849 | 57.5 | 94 596 881 | 73.5 |
| Other | 1 806 086 | 4.1 | 1 259 726 | 4.2 | 0 | 0.0 |
| Unknown | 7 823 856 | 17.7 | 4 021 343 | 13.3 | 0 | 0.0 |
| West region total | 16 113 086 | 10 413 940 | 78 743 364 | |||
| Race, n, % | ||||||
| American Indian or Alaskan Native | 65 199 | 0.4 | 45 439 | 0.4 | 2 256 781 | 2.9 |
| Asian or Native Hawaiian or Other Pacific Islander | 1 291 153 | 8.0 | 939 960 | 9.0 | 10 529 884 | 13.4 |
| Black or African American | 846 952 | 5.3 | 592 506 | 5.7 | 4 897 515 | 6.2 |
| White | 7 664 648 | 47.6 | 5 335 411 | 51.2 | 61 059 184 | 77.5 |
| Other | 1 902 665 | 11.8 | 1 259 945 | 12.1 | 0 | 0.0 |
| Unknown | 4 342 469 | 26.9 | 2 240 679 | 21.5 | 0 | 0.0 |
These values may not add up to 100% for each category due to rounding.
TABLE 3.
Baseline race and ethnicity distribution among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census, excluding the ‘unknown’ race and ethnicity categories.
| Baseline characteristics | Network | |||||
|---|---|---|---|---|---|---|
| TNX dataworks‐USA network | ||||||
| All patients | Encounter in the past 5 years | 2022 US census | ||||
| N or mean | % or SD a | N or mean | % or SD a | N or mean | % or SD a | |
| Demographic characteristics | ||||||
| Race, n, % (no unknown category) | ||||||
| Total | 90 878 817 | 62 144 832 | 333 287 557 | |||
| American Indian or Alaskan Native | 345 805 | 0.4 | 252 537 | 0.4 | 4 382 234 | 1.3 |
| Asian | 4 102 745 | 4.5 | 2 976 645 | 4.8 | 20 953 941 | 6.3 |
| Native Hawiian or Other Pacific Islander | 446 487 | 0.5 | 289 811 | 0.5 | 878 808 | 0.3 |
| Black or African American | 15 037 371 | 16.5 | 10 701 021 | 17.2 | 45 399 743 | 13.6 |
| White | 64 217 093 | 70.7 | 43 501 760 | 70.0 | 251 602 174 | 75.5 |
| Other | 6 729 316 | 7.4 | 4 423 058 | 7.1 | 10 070 657 | 3.0 |
| Ethnicity, n, % (no unknown category) | ||||||
| Total | 73 021 813 | 53 229 617 | 333 287 557 | |||
| Hispanic | 11 672 333 | 16.0 | 7 909 567 | 14.9 | 63 655 229 | 19.1 |
| Non‐Hispanic | 61 349 480 | 84.0 | 45 320 050 | 85.1 | 269 632 328 | 80.9 |
These values may not add up to 100% for each category due to rounding.
Additionally, among all patients in the TriNetX Dataworks‐USA Network, patients have an average of over 75 diagnoses available per patient, as well as approximately 150 medication records per patient, and over 270 lab values and procedures. On average, patients in the TriNetX Dataworks‐USA Network have 6.2 years between their first available encounter and their most recent available encounter, however the average length of follow‐up in EHR‐based studies varies in the literature and depends on each study's case definition.
4. Discussion
The TriNetX Dataworks‐USA Network combines clinically rich data from over 60 HCOs across the United States, and the basic demographics of the TriNetX population (i.e., sex, age, race, and ethnicity) closely align with those in the 2022 US Census data, with some caveats that are to be expected given that the Dataworks‐USA Network represents a healthcare‐seeking population. Generally, Americans who do not access primary care have been found to be younger, less medically complex, of minority background, and/or live in the South [16]. In the United States, older individuals generally tend to be more likely to access primary care than those who are younger [16, 17]. In line with general expectations for a healthcare‐seeking population, the TriNetX Dataworks‐USA Network population is on average approximately 5 years older than the US population (Table 1, Figure 1).
FIGURE 1.

Sex and age distribution of the TriNetX research network.
Furthermore, when comparing age by sex between the TriNetX Dataworks‐USA Network and the general US population as well as existing trends in the literature [18], there appears to be a larger proportion of women of older ages, which aligns with the general population trend of women having a greater life expectancy than men (Figures 1 and 2) [18]. Additionally, there is a valley between early 50 years old and 60 years old patients of both sexes in the TriNetX Dataworks‐USA Network (Figure 1), as well as in the general population (Figure 2), which represents the end of the Baby Boomer generation, who are currently 60–80 years of age.
FIGURE 2.

Sex and age distribution of the 2022 US census.
The TriNetX Dataworks‐USA Network correspondingly has a larger proportion of women overall (Table 1), particularly among patients over 18 years of age (Figure 1), relative to the general US population (Figure 2). In the U.S., adult women tend to be more likely to access primary care [19, 20]. This difference is pronounced during the child‐bearing years, which is reflected in the TriNetX age by sex distribution (Figure 1) [19].
Because race and ethnicity are self‐reported and not mandatory fields in EHR documentation, there is a significant proportion of the TriNetX population for whom this field is not available, 21.8% and 37.2%, respectively, while the US Census data does not have an “unknown” category (Table 1, Figures 3 and 4). Race or ethnicity may be Unknown in TriNetX's Dataworks‐USA Network for several reasons: it could be due to this information being documented as'unknow' by the HCO, the response was missing when the data was made available to TriNetX by the HCO, or the response lacked a mapping to the appropriate standard for other reasons. Race and ethnicity historically have been used to perpetuate fallacious stereotypes about biological differences between groups, which has led to mistrust and may lead to under‐reporting or discordant reporting between visits among individuals in minoritized communities [21]. Additionally, many Hispanic respondents may not identify with any of the OMB‐defined race categories [22]. These factors can all contribute to missing or discordant race and ethnicity data in EHR. It is notable, however, that race data in the TriNetX Dataworks‐USA Network has a lower percentage of patients with Unknown race in the past 5 years as compared with all historic results (17% vs. 22%, Table 1).
FIGURE 3.

Race distribution of the TriNetX research network and the 2022 US census.
FIGURE 4.

Ethnicity distribution of the TriNetX research network and the 2022 US census.
With regard to regional findings, the West and the Midwest geographic regions are relatively underrepresented in the TriNetX Dataworks‐USA Network, while the Northeast is overrepresented as compared to the general US population (Table 1). In spite of this, race distribution is similar within each geographic region between the TriNetX Dataworks‐USA Network and the general US population when looking at only patients in the TriNetX Dataworks‐USA Network whose race is not Unknown (Table 4). Differences in the distribution of race overall seem to be driven by the fact that the EHR allows for race to be listed as Other or Unknown, or to remain undocumented—categories that are not available in the US Census's surveys. The exception to this is the percent of the TriNetX Dataworks‐USA Network population in the South who identify as white, which is lower than the general US population; however, when looking only among patients with an encounter in the past 5 years, this difference is greatly reduced (Table 2). It appears that prior to this time period, there may have been a meaningful proportion of the TriNetX Dataworks‐USA Network patient population who were identified as other, which has decreased in recent years (Table 2).
TABLE 4.
Regional baseline race and ethnicity distribution among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census, excluding the ‘unknown’ race and ethnicity categories.
| Baseline characteristics | Network | |||||
|---|---|---|---|---|---|---|
| TNX dataworks‐USA network | ||||||
| All patients | Encounter in the past 5 years | 2022 US census | ||||
| N or mean | % or SD a | N or mean | % or SD a | N or mean | % or SD a | |
| Regional race distribution comparison | ||||||
| Northeast region total | 24 111 810 | 15 626 299 | 57 040 406 | |||
| Race, n, % (no unknown category) | ||||||
| American Indian or Alaskan Native | 70 026 | 0.3 | 46 843 | 0.3 | 493 206 | 0.9 |
|
Asian or Native Hawaiian or Other Pacific Islander |
1 243 702 | 5.2 | 825 013 | 5.3 | 4 489 260 | 7.9 |
| Black or African American | 2 990 362 | 12.4 | 2 001 982 | 12.8 | 8 297 391 | 14.5 |
| White | 17 878 611 | 74.1 | 11 482 499 | 73.5 | 43 760 549 | 76.7 |
| Other | 1 929 109 | 8.0 | 1 269 962 | 8.1 | 0 | 0.0 |
| Midwest region total | 14 220 606 | 9 528 344 | 68 787 595 | |||
| Race, n, % | ||||||
| American Indian or Alaskan Native | 104 498 | 0.7 | 79 554 | 0.8 | 719 506 | 1.0 |
|
Asian or Native Hawaiian or Other Pacific Islander |
453 408 | 3.2 | 330 150 | 3.5 | 2 811 439 | 4.1 |
| Black or African American | 1 869 899 | 13.1 | 1 269 277 | 13.3 | 8 197 362 | 11.9 |
| White | 10 701 345 | 75.3 | 7 215 938 | 75.7 | 57 059 288 | 82.9 |
| Other | 1 091 456 | 7.7 | 633 425 | 6.6 | 0 | 0.0 |
| South region total | 36 488 523 | 26 134 700 | 128 716 192 | |||
| Race, n, % | ||||||
| American Indian or Alaskan Native | 105 811 | 0.3 | 80 517 | 0.3 | 1 532 854 | 1.2 |
|
Asian or Native Hawaiian or Other Pacific Islander |
1 259 534 | 3.5 | 949 122 | 3.6 | 5 755 240 | 4.5 |
| Black or African American | 8 811 504 | 24.1 | 6 501 486 | 24.9 | 26 831 217 | 20.8 |
| White | 24 505 588 | 67.2 | 17 343 849 | 66.4 | 94 596 881 | 73.5 |
| Other | 1 806 086 | 4.9 | 1 259 726 | 4.8 | 0 | 0.0 |
| West region total | 11 770 617 | 8 173 261 | 78 743 364 | |||
| Race, n, % | ||||||
| American Indian or Alaskan Native | 65 199 | 0.6 | 45 439 | 0.6 | 2 256 781 | 2.9 |
|
Asian or Native Hawaiian or Other Pacific Islander |
1 291 153 | 11.0 | 939 960 | 11.5 | 10 529 884 | 13.4 |
| Black or African American | 846 952 | 7.2 | 592 506 | 7.2 | 4 897 515 | 6.2 |
| White | 7 664 648 | 65.1 | 5 335 411 | 65.3 | 61 059 184 | 77.5 |
| Other | 1 902 665 | 16.2 | 1 259 945 | 15.4 | 0 | 0.0 |
These values may not add up to 100% for each category due to rounding.
The percent of TriNetX Dataworks‐USA patients with an Unknown race is approximately 22% overall and is highest in the West (27%) and lowest in the South (18%) (Tables 1 and 2). This is similar to the percent missing race information in other large healthcare databases [23], and when looking only among patients with a clinical encounter in the past 5 years, the percentage of patients who identified as Unknown has decreased among all regions and overall. In March of 2024, the Federal Interagency Technical Working Group on Race and Ethnicity Standards (Working Group) recommended to the Office of Management and Budget that race and ethnicity be captured in a single question allowing for multiple responses; as this change is implemented, further changes to the completeness of the race and ethnicity questions in the EHR may occur [22].
Although the TriNetX Dataworks‐USA Network represents a robust collection of EHR data from over 60 HCOs across the United States, there are some limitations, as with all data sources. First, due to privacy considerations, the geographic location assigned to the patient is that of the HCO where they received medical care; thus, it is not feasible to conduct analyses using the geographic location specific to the patient's home address. HCO location can be used as a proxy, but given the structure of the healthcare system in the United States, patients do not always receive care where they live. Another limitation is that race and ethnicity are missing for over one quarter of patients in the Dataworks‐USA Network; however, this is similar to other large, nationally representative healthcare data sources [24]. Another privacy‐related constraint is that some dates, such as date of death, are obfuscated to reduce re‐identification risk; this may complicate some analyses, particularly time‐to‐event analyses with death as the outcome. The TriNetX Dataworks‐USA Network includes month and year of death, but day is omitted. Death data may also be incomplete; thus, some patients may be misclassified as living when they are deceased. Additionally, these data do not currently have cause of death available, though a subset of these patients can be linked via tokenization to additional sources of death data providing a more robust view of death which may occur outside of a healthcare system. However, many real‐world data sources have reported similar challenges [8, 10, 11, 12, 24, 25]. Finally, the Dataworks‐USA Network is captured from EHR data; the limitations of EHR‐only data in the U.S. are well‐established [24]. Patients may present to multiple institutions, resulting in fragmented records. EHR data by definition reflects a healthcare‐seeking population Table 4.
Despite the fact that the TriNetX Dataworks‐USA Network is comprised of EHR data collected as a part of routine patient care, whereas the US Census publishes estimates based on rigorously collected survey data, the TriNetX Dataworks‐USA Network appears to align relatively well with the overall demographics of the general US population. Observed differences are largely consistent with prior literature on demographic differences expected in healthcare‐seeking populations relative to the general population.
4.1. Plain Language Summary
TriNetX's Dataworks‐USA Network provides a robust data source for researchers looking to examine health trends within the United States and is broadly generalizable to the US population with the limitations to be expected from real‐world data originating in an inpatient and outpatient healthcare setting.
Conflicts of Interest
The authors of this article are all employees of TriNetX, the vendor of the main data source under study in this article.
Stein E., Hüser M., Amirian E. S., Palchuk M. B., and Brown J. S., “ TriNetX Dataworks‐USA: Overview of a Multi‐Purpose, De‐Identified, Federated Electronic Health Record Real‐World Data and Analytics Network and Comparison to the US Census,” Pharmacoepidemiology and Drug Safety 34, no. 9 (2025): e70198, 10.1002/pds.70198.
Funding: This work was supported by TriNetX.
References
- 1. Charvériat M., Darmoni S. J., Lafon V., et al., “Use of Real‐World Evidence in Translational Pharmacology Research,” Fundamental and Clinical Pharmacology 36, no. 2 (2022): 230–236, 10.1111/fcp.12734. [DOI] [PubMed] [Google Scholar]
- 2. Prilla S., Groeneveld S., Pacurariu A., et al., “Real‐World Evidence to Support EU Regulatory Decision Making‐Results From a Pilot of Regulatory Use Cases,” Clinical Pharmacology and Therapeutics 116 (2024): 1188–1197, 10.1002/cpt.3355. [DOI] [PubMed] [Google Scholar]
- 3. “Framework for FDA's Real‐World Evidence Program,” (2018), https://www.fda.gov/media/120060/download.
- 4. “Regulatory Science to 2025,” (2020), https://www.ema.europa.eu/en/about‐us/how‐we‐work/regulatory‐science‐strategy.
- 5. Alipour‐Haris G. L., Acha V., Winterstein A. G., and Burcu M., “Real‐World Evidence to Support Regulatory Submissions: A Landscape Review and Assessment of Use Cases,” Clinical and Translational Science 17, no. 8 (2024): e13903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Hunger M., Bardenheuer K., Passey A., Schade R., Sharma R., and Hague C., “The Value of Federated Data Networks in Oncology: What Research Questions Do They Answer? Outcomes From a Systematic Literature Review,” Value in Health 25, no. 5 (2022): 855–868. [DOI] [PubMed] [Google Scholar]
- 7. Palchuk M. B., London J. W., Perez‐Rey D., et al., “A Global Federated Real‐World Data and Analytics Platform for Research,” JAMIA Open 6, no. 2 (2023): ooad035, 10.1093/jamiaopen/ooad035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Topaloglu U. and Palchuk M. B., “Using a Federated Network of Real‐World Data to Optimize Clinical Trials Operations,” JCO Clinical Cancer Informatics 2 (2018): 1–10, 10.1200/cci.17.00067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Suissa S., Henry D., Caetano P., et al., “CNODES: The Canadian Network for Observational Drug Effect Studies,” Open Medicine 6, no. 4 (2012): e134–e140. [PMC free article] [PubMed] [Google Scholar]
- 10. Curtis L. H., Weiner M. G., Boudreau D. M., et al., “Design Considerations, Architecture, and Use of the Mini‐Sentinel Distributed Data System,” Pharmacoepidemiology and Drug Safety 21, no. Suppl 1 (2012): 23–31, 10.1002/pds.2336. [DOI] [PubMed] [Google Scholar]
- 11. Curtis L. H., Brown J., and Platt R., “Four Health Data Networks Illustrate the Potential for a Shared National Multipurpose Big‐Data Network,” Health Affairs 33, no. 7 (2014): 1178–1186, 10.1377/hlthaff.2014.0121. [DOI] [PubMed] [Google Scholar]
- 12. Haendel M. A., Chute C. G., Bennett T. D., et al., “The National COVID Cohort Collaborative (N3C): Rationale, Design, Infrastructure, and Deployment,” Journal of the American Medical Informatics Association 28, no. 3 (2021): 427–443, 10.1093/jamia/ocaa196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Kahn M. G., Callahan T. J., Barnard J., et al., “A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data,” EGEMS 4, no. 1 (2016): 1244, 10.13063/2327-9214.1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. “International Database: World Population Estimates and Projections,” (2022), https://www.census.gov/programs‐surveys/international‐programs/about/idb.html.
- 15. “Surveillance, Epidemiology, and End Results (SEER) Program Populations (1969‐2022),” www.seer.cancer.gov/popdata.
- 16. Levine D. M., Linder J. A., and Landon B. E., “Characteristics of Americans With Primary Care and Changes Over Time, 2002‐2015,” JAMA Internal Medicine 180, no. 3 (2020): 463–466, 10.1001/jamainternmed.2019.6282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Keisler‐Starkey K. a. B. and Lisa N., Health Insurance Coverage in the United States: 2021 Current Population Reports, Issue. U. S. G. P. Office (2022).
- 18. “Actuarial Life Table,” https://www.ssa.gov/oact/STATS/table4c6.html.
- 19. Hing E. and Albert M., “State Variation in Preventive Care Visits, by Patient Characteristics,” NCHS Data Brief 234 (2016): 1–8. [PubMed] [Google Scholar]
- 20. Long M., Frederiksen B., Ranji U., and Salganicoff A., “Women's Health Care Utilization and Costs: Findings From the 2020 KFF Women's Health Survey,” (2021), https://www.kff.org/womens‐health‐policy/issue‐brief/womens‐health‐care‐utilization‐and‐costs‐findings‐from‐the‐2020‐kff‐womens‐health‐survey/#:~:text=For%20example%2C%20nine%20in%20ten,%25%20men%20ages%2050%2D64.
- 21. Yemane L., Mateo C. M., and Desai A. N., “Race and Ethnicity Data in Electronic Health Records‐Striving for Clarity,” JAMA Network Open 7, no. 3 (2024): e240522. [DOI] [PubMed] [Google Scholar]
- 22. “Revisions to OMB's Statistical Policy Directive No. 15: Standards for Maintaining, Collecting, and Presenting Federal Data on Race and Ethnicity,” (2024) https://www.federalregister.gov/documents/2024/03/29/2024‐06469/revisions‐to‐ombs‐statistical‐policy‐directive‐no‐15‐standards‐for‐maintaining‐collecting‐and.
- 23. Polubriaginof F. C. G., Ryan P., Salmasian H., et al., “Challenges With Quality of Race and Ethnicity Data in Observational Databases,” Journal of the American Medical Informatics Association 26, no. 8–9 (2019): 730–736, 10.1093/jamia/ocz113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kohane I. S., Aronow B. J., Avillach P., et al., “What Every Reader Should Know About Studies Using Electronic Health Record Data but May be Afraid to Ask,” Journal of Medical Internet Research 23, no. 3 (2021): e22219, 10.2196/22219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Qualls L. G., Phillips T. A., Hammill B. G., et al., “Evaluating Foundational Data Quality in the National Patient‐Centered Clinical Research Network (PCORnet),” eGEMs (Generating Evidence and Methods to Improve Patient Outcomes) 6, no. 1 (2018): 3, 10.5334/egems.199. [DOI] [PMC free article] [PubMed] [Google Scholar]
