Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Sep 7;34(9):e70198. doi: 10.1002/pds.70198

TriNetX Dataworks‐USA: Overview of a Multi‐Purpose, De‐Identified, Federated Electronic Health Record Real‐World Data and Analytics Network and Comparison to the US Census

Ellen Stein 1, Matthias Hüser 1, E Susan Amirian 1, Matvey B Palchuk 1,2, Jeffrey S Brown 1,3,
PMCID: PMC12414656  PMID: 40915660

ABSTRACT

Introduction

Many clinical data networks often focus on a single use‐case or disease. By contrast, the TriNetX Dataworks‐USA Network contains real‐world clinical information that can be applied to multiple research questions and use cases. The purpose of this study is to describe the Network's characteristics, as well as its generalizability to the US population, particularly the healthcare‐seeking population.

Methods

Using the Dataworks‐USA Network, a large, regularly updated data network containing de‐identified patient electronic health record (EHR) information from across the United States, basic demographics were summarized and compared to the US Census Bureau International Database (IDB) 2022 data and the National Cancer Institute's version of the Census Bureau's U.S. County Population Data for 2022 to examine the generalizability of the Network.

Results

Patients in the Dataworks‐USA Network are approximately 5 years older than the Census, and the Network has a larger proportion of female patients. The Network has a lower proportion of patients identified as Asian and White race, and a higher proportion who identify as other relative to the Census; other races are similar between the two data sources (< 1% difference). Regionally, Dataworks‐USA has a smaller proportion of patients in all race categories compared with the Census due to the larger proportion of patients of Unknown or Other race.

Conclusions

TriNetX's Dataworks‐USA Network provides a robust data source for many use cases and is broadly generalizable to the US population, particularly the healthcare‐seeking population, with differences related to the underlying nature of the data sources.

Keywords: census, electronic health record, federated data network, generalizability, real‐world data


Summary.

  • TriNetX Dataworks‐USA Network is a large research network comprised of de‐identified EHR data collected from multiple health systems as a part of routine patient care. The data appear to align well with the overall demographics of the general US population.

  • These data can support multiple research questions and use cases, such as disease burden, treatment patterns, health outcomes, and comparative medical product safety and effectiveness.

  • Observed differences between the TriNetX Dataworks‐USA Network and the general US population are largely consistent with prior literature on demographic differences expected in healthcare seeking populations relative to the general population.

1. Introduction

Since the 21st Century Cures Act, use of real‐world data (RWD) to accelerate medical product development [1] and support regulatory decision‐making has become increasingly more common [2, 3, 4]. While randomized clinical trials (RCTs) have long been considered the gold standard for regulatory decisions, RWE studies have become increasingly accepted as supportive evidence when RCTs are not feasible or ethical [5]. Federated data networks, which enable analysis of patient data from multiple organizations while minimizing privacy risk, allow researchers to access larger, richer source populations that can help increase statistical power, draw more precise results, evaluate rare diseases, and deepen insights [6].

In 2014, TriNetX developed an RWD real‐time querying platform that leverages a federated data network from 65 healthcare organizations (HCOs) across the United States (as of July 22, 2024), as well as 182 total HCOs globally (as of July 22, 2024). The TriNetX platform hosts a large collection of de‐identified data, key analytic capabilities, and opportunities for clinical trial site identification, engagement, and participation [7, 8]. TriNetX's platform is Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR), and General Personal Data Protection Law (LGPD) compliant [7]. The platform allows clients with user access to real‐time iterative querying access to up‐to‐date federated electronic health record (EHR) data across these HCOs in real‐time through platform queries [7]. It is also possible to download instances of the de‐identified data to analyze offline, or to purchase custom datasets.

Large, multi‐site data networks previously described in the literature are often designed and governed for a single purpose, such as surveillance or the observation of drug effects, or for a single disease or disease area, such as cancer research or COVID‐19 [9, 10, 11, 12]. The TriNetX Dataworks‐USA Network was designed to address a broad set of research questions from clinical trial design optimization to burden of illness, treatment patterns, and outcome studies. The purpose of this study is to describe TriNetX's Dataworks‐USA Network characteristics, as well as its generalizability to the US population. Other TriNetX networks include global data, region‐specific data (e.g., Latin America, European Union), and the Linked network, which links tokenized U.S. EHR data to closed claims and death data.

2. Methods

2.1. Data Source

The Dataworks‐USA Network is comprised of approximately 75% academic medical centers and 25% community hospitals, integrated delivery networks, specialty hospitals, and large specialty physician practices. The Network has detailed clinical information available for over 110 million patients with frequent updates and little to no data lag, as HCOs update their data every 2 weeks on average; though the exact frequency varies by HCO. The Network contains data on medical encounters in the inpatient and outpatient settings that include demographic information, diagnoses recorded, medications administered, prescriptions written, laboratory test results, vital signs, and procedures for each medical encounter and day of a hospital stay. The TriNetX Dataworks‐USA Network includes structured data from all HCOs, as well as unstructured clinical documents available from a subset of HCOs in the network. Geographically, HCOs are well‐distributed across the US. EHR data from the TriNetX Dataworks‐USA Network are generated from routine healthcare encounters within each participating HCO. Patients may receive all or a proportion of their care at the HCO in the TriNetX Dataworks‐USA Network. Healthcare encounters that occur outside the contributing HCO will not be observed.

Extensive data quality procedures are implemented on an ongoing basis. TriNetX data sourcing principles include: liberating all health data, preserving the original data and documenting provenance, harmonizing for interoperability, and actively monitoring quality. The data quality program follows the guidelines developed by Kahn et al. [13] and includes reviews of conformance, completeness, and plausibility; these categories are further separated into two evaluation contexts: validation and verification. All patient data in the TriNetX Network are harmonized to standard terminologies. Clinical facts from the EHR are represented by International Classification of Disease Ninth/Tenth Revision, Clinical Modification (ICD‐9/10‐CM) diagnosis codes; Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System (HCPCS), and International Classification of Diseases, Tenth Revision, Procedure Coding System (ICD‐10‐PCS) procedure codes; RxNorm medication codes organized into Veterans Affairs Therapeutic Class System (VA Class) and Anatomic Therapeutic Chemical (ATC) hierarchies; Fast Healthcare Interoperability Resources—Health Level Seven International Release 4 (HL7 FHIR Release 4) encounter type codes, and Logical Observation Identifiers Names and Codes (LOINC) for lab tests, among others.

2.2. Statistical Analysis

Data from U.S. sites in the TriNetX Dataworks‐USA Network were accessed on July 25, 2024, using SQL queries. Aggregate statistics from each site were retrieved and then further aggregated. Patients' sex, date of birth, race, and ethnicity were obtained directly from this query. If a death record existed in the EHR data, age was calculated as year of birth subtracted from year of death; otherwise, age was calculated as year of birth subtracted from the current year at the time of the analysis, i.e., 2024. Sex, race, and ethnicity all include an Unknown category. Unknown could be due to this information being documented as ‘unknown’ by the HCO, missing when the data was made available to TriNetX by the HCO, or lacking a mapping to the appropriate standard for other reasons. Other race classification is applied to those patients for whom more than one race is specified in the EHR data or for whom the race documented in the EHR is not one of the available options in the TriNetX Dataworks‐USA Network, which are consistent with the HL7 CDC Version 1 standards. This same analysis was then repeated only among patients who had at least one clinical encounter in the past five complete calendar years (2019–2023).

US Census Bureau International Database (IDB) 2022 data was downloaded on February 8, 2024 [14]. Counts of US residents by sex, age, race, and ethnicity were obtained from the downloaded data. Additionally, the National Cancer Institute's version of the Census Bureau's U.S. County Population Data for 2022 was downloaded on June 11, 2024 [15]. From the latter dataset, counts of U.S. residents by county by race were used to calculate regional race estimates.

Visualizations of the HCO and Census statistics were generated using Python data analysis packages (Pandas, Matplotlib, Seaborn). Descriptive statistics were post‐processed using Python data analysis packages (Pandas, NumPy).

3. Results

The TriNetX Dataworks‐USA Network and general U.S. population characteristics are presented in Table 1. The TriNetX Dataworks‐USA Network contains EHR records of over 110 million patients, with most data from 2007 to present, though it is possible that some of these may be duplicates if they receive care at multiple institutions. The U.S. population numbered over 330 million people as of 2022. Over 72 million patients have had at least one encounter in the past five complete calendar years (2019–2023); these patients with a recent encounter documented have a relatively similar sex, age, race, ethnicity, and geographic distribution, though there were fewer patients with an unknown race and unknown ethnicity in this subset (Table 1).

TABLE 1.

Baseline characteristics among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census.

Baseline characteristics Network
TNX dataworks‐USA network
All patients Encounter in the past 5 years 2022 US census
N or mean % or SD a N or mean % or SD a N or mean % or SD a
Total N c 110 731 104 N/A 72 938 765 N/A 330 708 020 N/A
Demographic characteristics
Age
Mean, SD 43.9 23.5 43.2 23.7 39.1 22.9
Categories, n, %
0–4 3 367 166 3.0 2 965 865 4.1 18 538 353 5.6
5–17 14 730 022 13.3 9 949 929 13.6 53 912 474 16.3
18–24 9 263 483 8.4 6 094 486 8.4 31 328 131 9.5
25–34 15 132 179 13.7 9 604 935 13.2 45 501 300 13.8
35–44 14 909 200 13.5 9 530 861 13.1 43 695 365 13.2
45–54 12 874 291 11.6 8 385 387 11.5 40 431 645 12.2
55–64 13 972 084 12.6 9 406 323 12.9 42 085 437 12.7
65+ 26 482 679 23.9 17 000 979 23.3 55 215 315 16.7
Sex, n, %
Female 58 576 267 52.9 39 362 042 54.0 166 215 285 50.3
Male 51 728 456 46.7 33 272 283 45.6 164 492 735 49.7
Unknown 426 381 0.4 304 440 0.4 0 0.0
Race, n, %
American Indian or Alaskan Native 345 805 0.3 252 537 0.3 4 382 234 1.3
Asian 4 102 745 3.5 2 976 645 4.0 20 953 941 6.3
Native Hawaiian or Other Pacific Islander 446 487 0.4 289 811 0.4 878 808 0.3
Asian or Native Hawaiian or Other Pacific
Islander (4‐race category) b 4 549 232 3.9 3 266 456 4.4 21 832 749 6.6
Black or African American 15 037 371 12.9 10 701 021 14.4 45 399 743 13.6
White 64 217 093 55.3 43 501 760 58.3 251 602 174 75.5
Other 6 729 316 5.8 4 423 058 5.9 10 070 657 3.0
Unknown 25 311 283 21.8 12 408 459 16.6 0 0.0
Ethnicity, n, %
Hispanic 11 672 333 10.0 7 909 567 10.6 63 655 229 19.1
Non‐Hispanic 61 349 480 52.8 45 320 050 60.8 269 632 328 80.9
Unknown 43 168 287 37.2 21 323 674 28.6 0 0.0
Geographic region, n, %
Midwest 17 611 847 15.9 10 677 503 15.0 68 787 595 20.6
Northeast 32 703 609 29.5 19 861 902 27.9 57 040 406 17.1
South 44 312 379 40.0 30 156 043 42.4 128 716 192 38.6
West 16 113 086 14.6 10 413 940 14.6 78 743 364 23.6
a

These values may not add up to 100% for each category due to rounding.

b

This is an additional category listed in the data from the Census Bureau. For TriNetX data, the corresponding individual categories were summed. This should not be included when summing categories to 100%; as including it will cause the total to exceed 100%.

c

Patients are unique within a single HCO; however, due to the de‐identified nature of the data, patients who receive care at more than one HCO in the Network will not have their records linked and thus may be counted > 1 time.

The mean age among TriNetX Network patients is several years greater than the general US population (based on the US Census), 43.9 years (±23.5) versus 39.1 years old (±22.9), with a higher proportion of patients 65 or more years old and a smaller proportion of 0–17 year olds (Table 1). Additionally, the TriNetX population has a slightly larger proportion of female patients as compared with the general US population (53% vs. 50%).

Among the entire TriNetX Dataworks‐USA Network population, the proportion of patients who are Black or African American, American Indian or Alaskan Native, and Native Hawaiian or Other Pacific Islander are similar to the proportions of the general US population who identify as each of those races (< 1% difference). The TriNetX Dataworks‐USA Network population has a lower proportion who identified as Asian and White race, and a higher proportion who identify as Other than the general US population. For ethnicity, TriNetX Dataworks‐USA Network population has a lower proportion of both Hispanic and Non‐Hispanic patients than the general US population due to having 37.2% of the population with ethnicity as unknown.

However, focusing on the distribution of race and ethnicity among those who are not Unknown in the TriNetX population, the distribution more closely aligns with that of the Census overall. The following is the distribution of race among TriNetX patients whose race is documented (proportions recalculated to exclude unknown race and ethnicity, respectively): American Indian or Alaskan Native, 0.4%; Asian, 4.5%; Native Hawaiian or Other Pacific Islander, 0.5%; Black or African American, 16.5%, White, 70.7%; Other, 7.4%. For ethnicity: Hispanic, 16.0%; Not Hispanic or Latino, 84.0%.

The TriNetX Dataworks‐USA Network patients are well‐distributed across the United States, though there is a higher proportion of patients from the Northeast and a lower proportion of patients from the West and Midwest as compared with the general US population (Table 1).

Across all regions, the TriNetX Dataworks‐USA Network has a smaller proportion of patients who identified as all known race categories as compared with the US Census, meaning that there are a larger proportion of patients in the TriNetX data who have an Unknown race or identified as Other (Table 2). When race by region is examined only among those whose race is known, the race distribution for all regions more closely resembles that of the US Census (Table 3).

TABLE 2.

Regional race distribution among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census.

Race category Network
TNX dataworks‐USA network
All patients Encounter in the past 5 years 2022 US census
N or mean % or SD a N or mean % or SD a N or mean % or SD a
Regional race distribution comparison
Northeast region total 32 703 609 19 861 902 57 040 406
Race, n, %
American Indian or Alaskan Native 70 026 0.2 46 843 0.2 493 206 0.9
Asian or Native Hawaiian or Other Pacific Islander 1 243 702 3.8 825 013 4.2 4 489 260 7.9
Black or African American 2 990 362 9.1 2 001 982 10.1 8 297 391 14.5
White 17 878 611 54.7 11 482 499 57.8 43 760 549 76.7
Other 1 929 109 5.9 1 269 962 6.4 0 0.0
Unknown 8 591 799 26.3 4 235 603 21.3 0 0.0
Midwest region total 17 611 847 10 677 503 68 787 595
Race, n, %
American Indian or Alaskan Native 104 498 0.6 79 554 0.7 719 506 1.0
Asian or Native Hawaiian or Other Pacific Islander 453 408 2.6 330 150 3.1 2 811 439 4.1
Black or African American 1 869 899 10.6 1 269 277 11.9 8 197 362 11.9
White 10 701 345 60.8 7 215 938 67.6 57 059 288 82.9
Other 1 091 456 6.2 633 425 5.9 0 0.0
Unknown 3 391 241 19.3 1 149 159 10.8 0 0.0
South region total 44 312 379 30 156 043 128 716 192
Race, n, %
American Indian or Alaskan Native 105 811 0.2 80 517 0.3 1 532 854 1.2
Asian or Native Hawaiian or Other Pacific Islander 1 259 534 2.8 949 122 3.1 5 755 240 4.5
Black or African American 8 811 504 19.9 6 501 486 21.6 26 831 217 20.8
White 24 505 588 55.3 17 343 849 57.5 94 596 881 73.5
Other 1 806 086 4.1 1 259 726 4.2 0 0.0
Unknown 7 823 856 17.7 4 021 343 13.3 0 0.0
West region total 16 113 086 10 413 940 78 743 364
Race, n, %
American Indian or Alaskan Native 65 199 0.4 45 439 0.4 2 256 781 2.9
Asian or Native Hawaiian or Other Pacific Islander 1 291 153 8.0 939 960 9.0 10 529 884 13.4
Black or African American 846 952 5.3 592 506 5.7 4 897 515 6.2
White 7 664 648 47.6 5 335 411 51.2 61 059 184 77.5
Other 1 902 665 11.8 1 259 945 12.1 0 0.0
Unknown 4 342 469 26.9 2 240 679 21.5 0 0.0
a

These values may not add up to 100% for each category due to rounding.

TABLE 3.

Baseline race and ethnicity distribution among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census, excluding the ‘unknown’ race and ethnicity categories.

Baseline characteristics Network
TNX dataworks‐USA network
All patients Encounter in the past 5 years 2022 US census
N or mean % or SD a N or mean % or SD a N or mean % or SD a
Demographic characteristics
Race, n, % (no unknown category)
Total 90 878 817 62 144 832 333 287 557
American Indian or Alaskan Native 345 805 0.4 252 537 0.4 4 382 234 1.3
Asian 4 102 745 4.5 2 976 645 4.8 20 953 941 6.3
Native Hawiian or Other Pacific Islander 446 487 0.5 289 811 0.5 878 808 0.3
Black or African American 15 037 371 16.5 10 701 021 17.2 45 399 743 13.6
White 64 217 093 70.7 43 501 760 70.0 251 602 174 75.5
Other 6 729 316 7.4 4 423 058 7.1 10 070 657 3.0
Ethnicity, n, % (no unknown category)
Total 73 021 813 53 229 617 333 287 557
Hispanic 11 672 333 16.0 7 909 567 14.9 63 655 229 19.1
Non‐Hispanic 61 349 480 84.0 45 320 050 85.1 269 632 328 80.9
a

These values may not add up to 100% for each category due to rounding.

Additionally, among all patients in the TriNetX Dataworks‐USA Network, patients have an average of over 75 diagnoses available per patient, as well as approximately 150 medication records per patient, and over 270 lab values and procedures. On average, patients in the TriNetX Dataworks‐USA Network have 6.2 years between their first available encounter and their most recent available encounter, however the average length of follow‐up in EHR‐based studies varies in the literature and depends on each study's case definition.

4. Discussion

The TriNetX Dataworks‐USA Network combines clinically rich data from over 60 HCOs across the United States, and the basic demographics of the TriNetX population (i.e., sex, age, race, and ethnicity) closely align with those in the 2022 US Census data, with some caveats that are to be expected given that the Dataworks‐USA Network represents a healthcare‐seeking population. Generally, Americans who do not access primary care have been found to be younger, less medically complex, of minority background, and/or live in the South [16]. In the United States, older individuals generally tend to be more likely to access primary care than those who are younger [16, 17]. In line with general expectations for a healthcare‐seeking population, the TriNetX Dataworks‐USA Network population is on average approximately 5 years older than the US population (Table 1, Figure 1).

FIGURE 1.

FIGURE 1

Sex and age distribution of the TriNetX research network.

Furthermore, when comparing age by sex between the TriNetX Dataworks‐USA Network and the general US population as well as existing trends in the literature [18], there appears to be a larger proportion of women of older ages, which aligns with the general population trend of women having a greater life expectancy than men (Figures 1 and 2) [18]. Additionally, there is a valley between early 50 years old and 60 years old patients of both sexes in the TriNetX Dataworks‐USA Network (Figure 1), as well as in the general population (Figure 2), which represents the end of the Baby Boomer generation, who are currently 60–80 years of age.

FIGURE 2.

FIGURE 2

Sex and age distribution of the 2022 US census.

The TriNetX Dataworks‐USA Network correspondingly has a larger proportion of women overall (Table 1), particularly among patients over 18 years of age (Figure 1), relative to the general US population (Figure 2). In the U.S., adult women tend to be more likely to access primary care [19, 20]. This difference is pronounced during the child‐bearing years, which is reflected in the TriNetX age by sex distribution (Figure 1) [19].

Because race and ethnicity are self‐reported and not mandatory fields in EHR documentation, there is a significant proportion of the TriNetX population for whom this field is not available, 21.8% and 37.2%, respectively, while the US Census data does not have an “unknown” category (Table 1, Figures 3 and 4). Race or ethnicity may be Unknown in TriNetX's Dataworks‐USA Network for several reasons: it could be due to this information being documented as'unknow' by the HCO, the response was missing when the data was made available to TriNetX by the HCO, or the response lacked a mapping to the appropriate standard for other reasons. Race and ethnicity historically have been used to perpetuate fallacious stereotypes about biological differences between groups, which has led to mistrust and may lead to under‐reporting or discordant reporting between visits among individuals in minoritized communities [21]. Additionally, many Hispanic respondents may not identify with any of the OMB‐defined race categories [22]. These factors can all contribute to missing or discordant race and ethnicity data in EHR. It is notable, however, that race data in the TriNetX Dataworks‐USA Network has a lower percentage of patients with Unknown race in the past 5 years as compared with all historic results (17% vs. 22%, Table 1).

FIGURE 3.

FIGURE 3

Race distribution of the TriNetX research network and the 2022 US census.

FIGURE 4.

FIGURE 4

Ethnicity distribution of the TriNetX research network and the 2022 US census.

With regard to regional findings, the West and the Midwest geographic regions are relatively underrepresented in the TriNetX Dataworks‐USA Network, while the Northeast is overrepresented as compared to the general US population (Table 1). In spite of this, race distribution is similar within each geographic region between the TriNetX Dataworks‐USA Network and the general US population when looking at only patients in the TriNetX Dataworks‐USA Network whose race is not Unknown (Table 4). Differences in the distribution of race overall seem to be driven by the fact that the EHR allows for race to be listed as Other or Unknown, or to remain undocumented—categories that are not available in the US Census's surveys. The exception to this is the percent of the TriNetX Dataworks‐USA Network population in the South who identify as white, which is lower than the general US population; however, when looking only among patients with an encounter in the past 5 years, this difference is greatly reduced (Table 2). It appears that prior to this time period, there may have been a meaningful proportion of the TriNetX Dataworks‐USA Network patient population who were identified as other, which has decreased in recent years (Table 2).

TABLE 4.

Regional baseline race and ethnicity distribution among all TriNetX dataworks‐USA patients, only those with an encounter in the past 5 years, and the 2022 US Census, excluding the ‘unknown’ race and ethnicity categories.

Baseline characteristics Network
TNX dataworks‐USA network
All patients Encounter in the past 5 years 2022 US census
N or mean % or SD a N or mean % or SD a N or mean % or SD a
Regional race distribution comparison
Northeast region total 24 111 810 15 626 299 57 040 406
Race, n, % (no unknown category)
American Indian or Alaskan Native 70 026 0.3 46 843 0.3 493 206 0.9

Asian or Native Hawaiian or Other Pacific

Islander

1 243 702 5.2 825 013 5.3 4 489 260 7.9
Black or African American 2 990 362 12.4 2 001 982 12.8 8 297 391 14.5
White 17 878 611 74.1 11 482 499 73.5 43 760 549 76.7
Other 1 929 109 8.0 1 269 962 8.1 0 0.0
Midwest region total 14 220 606 9 528 344 68 787 595
Race, n, %
American Indian or Alaskan Native 104 498 0.7 79 554 0.8 719 506 1.0

Asian or Native Hawaiian or Other Pacific

Islander

453 408 3.2 330 150 3.5 2 811 439 4.1
Black or African American 1 869 899 13.1 1 269 277 13.3 8 197 362 11.9
White 10 701 345 75.3 7 215 938 75.7 57 059 288 82.9
Other 1 091 456 7.7 633 425 6.6 0 0.0
South region total 36 488 523 26 134 700 128 716 192
Race, n, %
American Indian or Alaskan Native 105 811 0.3 80 517 0.3 1 532 854 1.2

Asian or Native Hawaiian or Other Pacific

Islander

1 259 534 3.5 949 122 3.6 5 755 240 4.5
Black or African American 8 811 504 24.1 6 501 486 24.9 26 831 217 20.8
White 24 505 588 67.2 17 343 849 66.4 94 596 881 73.5
Other 1 806 086 4.9 1 259 726 4.8 0 0.0
West region total 11 770 617 8 173 261 78 743 364
Race, n, %
American Indian or Alaskan Native 65 199 0.6 45 439 0.6 2 256 781 2.9

Asian or Native Hawaiian or Other Pacific

Islander

1 291 153 11.0 939 960 11.5 10 529 884 13.4
Black or African American 846 952 7.2 592 506 7.2 4 897 515 6.2
White 7 664 648 65.1 5 335 411 65.3 61 059 184 77.5
Other 1 902 665 16.2 1 259 945 15.4 0 0.0
a

These values may not add up to 100% for each category due to rounding.

The percent of TriNetX Dataworks‐USA patients with an Unknown race is approximately 22% overall and is highest in the West (27%) and lowest in the South (18%) (Tables 1 and 2). This is similar to the percent missing race information in other large healthcare databases [23], and when looking only among patients with a clinical encounter in the past 5 years, the percentage of patients who identified as Unknown has decreased among all regions and overall. In March of 2024, the Federal Interagency Technical Working Group on Race and Ethnicity Standards (Working Group) recommended to the Office of Management and Budget that race and ethnicity be captured in a single question allowing for multiple responses; as this change is implemented, further changes to the completeness of the race and ethnicity questions in the EHR may occur [22].

Although the TriNetX Dataworks‐USA Network represents a robust collection of EHR data from over 60 HCOs across the United States, there are some limitations, as with all data sources. First, due to privacy considerations, the geographic location assigned to the patient is that of the HCO where they received medical care; thus, it is not feasible to conduct analyses using the geographic location specific to the patient's home address. HCO location can be used as a proxy, but given the structure of the healthcare system in the United States, patients do not always receive care where they live. Another limitation is that race and ethnicity are missing for over one quarter of patients in the Dataworks‐USA Network; however, this is similar to other large, nationally representative healthcare data sources [24]. Another privacy‐related constraint is that some dates, such as date of death, are obfuscated to reduce re‐identification risk; this may complicate some analyses, particularly time‐to‐event analyses with death as the outcome. The TriNetX Dataworks‐USA Network includes month and year of death, but day is omitted. Death data may also be incomplete; thus, some patients may be misclassified as living when they are deceased. Additionally, these data do not currently have cause of death available, though a subset of these patients can be linked via tokenization to additional sources of death data providing a more robust view of death which may occur outside of a healthcare system. However, many real‐world data sources have reported similar challenges [8, 10, 11, 12, 24, 25]. Finally, the Dataworks‐USA Network is captured from EHR data; the limitations of EHR‐only data in the U.S. are well‐established [24]. Patients may present to multiple institutions, resulting in fragmented records. EHR data by definition reflects a healthcare‐seeking population Table 4.

Despite the fact that the TriNetX Dataworks‐USA Network is comprised of EHR data collected as a part of routine patient care, whereas the US Census publishes estimates based on rigorously collected survey data, the TriNetX Dataworks‐USA Network appears to align relatively well with the overall demographics of the general US population. Observed differences are largely consistent with prior literature on demographic differences expected in healthcare‐seeking populations relative to the general population.

4.1. Plain Language Summary

TriNetX's Dataworks‐USA Network provides a robust data source for researchers looking to examine health trends within the United States and is broadly generalizable to the US population with the limitations to be expected from real‐world data originating in an inpatient and outpatient healthcare setting.

Conflicts of Interest

The authors of this article are all employees of TriNetX, the vendor of the main data source under study in this article.

Stein E., Hüser M., Amirian E. S., Palchuk M. B., and Brown J. S., “ TriNetX Dataworks‐USA: Overview of a Multi‐Purpose, De‐Identified, Federated Electronic Health Record Real‐World Data and Analytics Network and Comparison to the US Census,” Pharmacoepidemiology and Drug Safety 34, no. 9 (2025): e70198, 10.1002/pds.70198.

Funding: This work was supported by TriNetX.

References


Articles from Pharmacoepidemiology and Drug Safety are provided here courtesy of Wiley

RESOURCES