Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2024 Dec 23;32(3):579–585. doi: 10.1093/jamia/ocae241

Descriptive epidemiology demonstrating the All of Us database as a versatile resource for the rare and undiagnosed disease community

Drenen J Magee 1, Sierra Kicker 2, Aeisha Thomas 3,
PMCID: PMC11833481  PMID: 39715481

Abstract

Objective

We aim to demonstrate the versatility of the All of Us database as an important source of rare and undiagnosed disease (RUD) data, because of its large size and range of data types.

Materials and Methods

We searched the public data browser, electronic health record (EHR), and several surveys to investigate the prevalence, mental health, healthcare access, and other data of select RUDs.

Results

Several RUDs have participants in All of Us [eg, 75 of 100 rare infectious diseases (RIDs)]. We generated health-related data for undiagnosed, sickle cell disease (SCD), cystic fibrosis (CF), and infectious (2 diseases) and chronic (4 diseases) disease pools.

Conclusion

Our results highlight the potential value of All of Us with both data breadth and depth to help identify possible solutions for shared and disease-specific biomedical and other problems such as healthcare access, thus enhancing diagnosis, treatment, prevention, and support for the RUD community.

Keywords: rare disease, rare and undiagnosed diseases, healthcare access, mental health, newborn screening

Introduction

To support individuals with rare and undiagnosed diseases (RUDs), a range of approaches are used, such as policies,1 registries (eg,2), networks (advocacy groups, research networks, eg,3,4), expert medical centers,5,6 rare disease (RD) databases (eg,7), Artificial Intelligence-directed data collation,8 integrative medicine,9 and study design algorithms.10 Despite these efforts, RUDs continue to present unique challenges, such as long or indefinite diagnostic odysseys, limited or absent treatments, expensive healthcare, and mental health and other concerns (11, Reviewed in Chung et al12). Patients, clinicians, researchers, policymakers, RUD advocacy, and other support groups are similarly tackling the basic issue that solutions are harder to find because these diseases are rare. Data is a critical factor that can help advance solutions across all these stakeholders (eg,13–16), and we propose that the All of Us Research Program database is uniquely poised to complement current RUD research due to its large size and range of data.

The goal of All of Us is to collect biomedical, behavioral, genomic, and other types of data from 1 million individuals, especially from underrepresented groups.17 Zeng et al,18 in their sweeping survey of prevalence measures across all types of diseases, found that both common and RDs are present in All of Us and that some RDs are enriched in All of Us. Similar large-scale efforts such as the UK Biobank19 and the RD and cancer-focused 100,000 Genomes Project20,21 have yielded progress. All of Us-based RD publications (eg,22) further support the idea that All of Us data is a potential substantial resource for RUDs.

This descriptive epidemiology study aims to build on these studies by highlighting how All of Us data could be used by various stakeholders in the RUD community. We determine the presence of select RDs in the publicly available All of Us data browser (PDB,23) and controlled-access data—electronic health records (EHRs) and personal and family health history survey (PFHHS)24 to show the different ways that RUD stakeholders (eg, family vs clinician/researcher) can explore All of Us. The extent to which RUDs, as identified by the All of Us program,25 are represented in the database was determined by finding lifetime prevalence, that is, the presence of a disease at any point during the course of life. There are estimated to be 6000+ RDs,26 and to demonstrate All of Us RUD data, we focused on the prevalence of 3 different RUD subcategories: (a) rare infectious diseases (RIDs), (b) newborn screening diseases, and (c) undiagnosed diseases (UD). Further, we identify several All of Us RUD-relevant data types (eg, employment, mental health, healthcare access, social parameters, etc,27 reviewed in Chung et al12 and Khan et al28) to demonstrate how the data can be used to develop solutions for multiple RDs, fulfilling a translational medicine principle that can be used with RDs.29–31 We hope that this study’s delineation of the many versatile ways in which All of Us data can be helpful will prompt future use by various stakeholders within the RUD community.

Methods

Extended methods are available in the Supplementary Materials. RDs are primarily defined by the genetic and rare disease (GARD) February 2024 List32 with possible additions. Please note that some diseases may be missing due to differences in categorization or naming and some discretion was used in our classification. Time of diagnosis was not factored into our calculations, so all prevalence data represents lifetime prevalence. This cross-sectional study uses data from the All of Us Research Program’s Controlled tier Access v7 Dataset(CD), available to authorized users on the Researcher Workbench. PDB is from the 2/15/2023 release.23

Prevalence of RD present in the PFHHS

RDs in All of Us PFHHS were identified with the exception of “Cancer Conditions.” Using CD, cohorts were built based on these inclusion criteria: completion of the PFHHS and diagnosis of a specific RUD found in EHR or PFHHS data.

RID prevalence

The “Conditions” category of the All of Us public data browser was used to determine the number of individuals with each of the 100 RIDs as defined by GARD.32 This is different from the Genetic and Rare Disease February 2024 List used for other parts of this study. Some RIDs may have been missed due to differences in naming and classification and some discretion was used.

RUSP prevalence

Using CD, 23 cohorts were created based on conditions listed in the recommended uniform screening panel (RUSP),33 released in 2023. Hearing loss and generalized sickle cell disease (SCD) were excluded. Data were gathered using the EHR information; due to a lack of distinction in the database, some cohorts include multiple types of the same condition. For some of the diseases with cohort sizes greater than 20, the prevalence in the US population was obtained.

Undiagnosed prevalence and overlap

The UD population was defined using the CD EHR Condition domain data. To determine the overlap of this subgroup with other RD subgroups, cohorts were built to include participants with UD and an additional RD per the conditions domain of their EHR.

Multidimensional investigation of health and lifestyle topics related to RDs

Cohorts were designed within the Cohort Builder of the Researcher Workbench using CD. Multiple All of Us data sources were used to describe many different types of healthcare and lifestyle-related data. Specifically, 5 surveys and 2 domains of EHR data were used to construct a plethora of cohorts which were then used to mine these data. For specific cohort design parameters, see Table S3.

Results

All of Us has different data sources34 such as EHR and surveys, like the PFHHS, which asks about specific medical conditions, including at least 10 RDs, indicating All of Us’s inclusivity of the RD population. Aggregate data can be accessed by anyone on the internet through the PDB and individuals can become All of Us researchers for further access.35  Table 1 shows that the prevalence of select RDs is similar between the PDB, controlled PFHHS, controlled EHR data, and as reported by Zeng et al.18 The PDB flagged participants for 75 of the 100 RIDs assessed, while EHR data are present for several RUSP diseases (Table 1, Tables S1 and S2). Please note that, depending on categorization, some RDs overlapped subcategories (eg, SCD). Participants were identified with both UD and RD data (Figure 1).

Table 1.

Prevalence of RUDs in All of Us.

All of Us prevalence from different sources
Rare disease Prevalence—PDB EHR conditions (n = 254 700) Prevalence—controlled access PFHHS (n = 185 232) Prevalence—controlled access EHR conditions (n = 250 242) Prevalence—All of Us per Zeng et al 18
Sickle cell 0.002356 (600) 0.0016 (301) 0.001007 (252) 0.002
Systemic lupus 0.014998 (3820) 0.0091 (1685) 0.015217 (3808) 0.013
Dengue fever 0.000157 (40) 0.0040 (741) 0.000152 (38) 0.000146
West Nile virus 0.000236 (60) 0.0011 (204) 0.000180 (45) N/A
Zika virus <=20 0.0005 (94) <=20 9.8e−05
SARS 0.000707 (180) 0.0027 (499) 0.000707 (177) 0.060
Lou Gehrig’s/Amyotrophic Lateral Schlerosis 0.000707 (180) 0.0003 (60) 0.000663 (166) N/A
Tuberculosis 0.007774 (1980) 0.0093 (1720) 0.007880 (1972) 0.004989
Spinal cord injury N/D 0.0232 (4296) 0.004304 (1077) N/D

Lyme disease

0.008324 (2120)

0.0225 (4175)

0.008468 (2119)

0.007

RID (from PDB EHR)

# Participants

# RIDs with participant data (100 RIDs total)
0 25
<=20 48
20-200 21

>200

6

RID condition >200 participants

Prevalence in PDB (n = 254 700)

Prevalence in All of Us per Zeng et al 18
Actinomycosis 0.0013 (340) N/A
Aspergillosis 0.0017 (440) 0.002
Bacterial endocarditis 0.0012 (300) N/D
Chronic Epstein-Barr virus 0.0009 (220) N/A
Coccidioidomycosis 0.0027 (700) N/A

Pneumocystosis

0.0014 (360)

N/A

Prevalence of RUSP (from controlled access EHR)

# Participants

# Primary conditions (#total = 35)

# Secondary conditions (#total = 26)
0 14 20
<=20 15 6

>20

6

0

RUSP condition >20

All of Us prevalence (n = 287 012)

U.S. prevalence

Prevalence in All of Us per Zeng et al 18
Congenital adrenal hyperplasia 1:4283 (67) 1:166636 N/A
Congenital hypothyroidism 1:976 (294) 1:14 28536 0.001
CF 1:639 (449) ∼40 00037 0.002
Homocystinuria 1:1112 (258) 1:10 00038 N/A
Sickle cell anemia 1:349 (823) ∼120 00039 0.002
Sickle cell β-thalassemia 1:2080 (138) 1:100 00040 0.000

The All of Us prevalence from multiple All of Us sources and subcategories RID and RUSP. 

Abbreviations: CF, cystic fibrosis; EHR, electronic health record; N/A, not present; N/D, not distinct; PFHHS, personal and family health history; RID, rare infectious disease; RUD, rare and undiagnosed disease; RUSP, recommended uniform screening panel.

Figure 1.

Venn diagram displaying a <=20 participant overlap of various RD subgroups—chronic pool (n = 3417), cystic fibrosis (n = 193) (CF), and infectious pool (n = 5737)—with the subcategory of participants defined as undiagnosed (n = 587).

Rare disease (RD) subcategories overlap with undiagnosed subgroups. “Chronic Pool” includes participants with systemic lupus, muscular dystrophy (MD), multiple sclerosis, or amyotrophic lateral sclerosis. “Infectious Pool” includes participants with tuberculosis or Lyme disease. “<=20” denotes a count of less than or equal to 20 participants. All RD subgroups were found to overlap with the undiagnosed disease (UD) population, meaning all subgroups were found to have some participants with a UD in addition to specific RDs.

To show the versatility of All of Us data to serve the RUD population, we then identified RUD participants with different types of All of Us data (Table 2). While we recognize that all RUD stakeholders could find all the data valuable, we suggest possible primary stakeholders for our diverse findings. The UD population had higher employment than the general population. Social satisfaction is an indicator of social quality of health,41 and the UD population had higher social satisfaction numbers. Health insurance coverage data were available across all studied groups. Depression prevalence was similar to or higher than the reference population, with the highest subgroup being the family of muscular dystrophy (MD) (Table 2). Participants have records showing the use of medications for SCD (hydroxyurea42) and CF (elexacaftor, tezacaftor, or ivacaftor43). For some RD subgroups, the expense of mental health care and prescription medicine was a limiting factor to treatment in the last 12 months, but at similar levels to the reference population. Insurance acceptance data were available for all of the studied groups as well, but some groups were less than 20. Finally, we describe levels of community support, which required the data to be coarsened to conform to All of Us data dissemination policy; however, it could be noted that RUD subgroups had less support than the reference population.

Table 2.

Demonstrating the breadth and depth of All of Us.

Primary All of Us data source Parameter selected Cohort Results Likely primary stakeholders
Basics survey Unemployment status Surveys (n = 178 102) 13.5% Social workers, policymakers, advocacy groups
UD (n = 420) 7.6%
Covered by insurance Surveys (n = 178 102) 94% Social workers, policymakers, advocacy groups
UD (n = 420) >85%*
SCD (n = 182) >85%*
CF (n = 193) >85%*
Infectious (n = 5737) 96%
Chronic (n = 3417) 96%
Overall health survey Social satisfaction (social indicators of health) Surveys (n = 178 102) 56.8% Families, advocacy groups, social workers, policymakers
UD (n = 420) 65.5%
PFHH survey Depression prevalence (mental health) Surveys (n = 178 102) 29.9% Mental health professionals, social workers, policymakers
UD (n = 420) 31.9%
SCD (n = 182) 33.0%
CF (n = 193) 33.2%
PFHH survey Depression prevalence (mental health) PFHH survey (n = 185 232) 29.4%
Family of SCD (n = 702) 32.1%
Family of MD (n = 404) 43.1%
Drug domain of EHR data Medication record SCD (n = 182) hydroxyurea 15.9% Clinicians, social workers
CF (n = 193) ivacaftor, tezacaftor, or elexacaftor 12.4%
Healthcare Access and Utilization (HAU) survey Mental healthcare too expensive in last 12 months Surveys (n = 178 102) 8% Social workers, advocacy groups, mental health professionals, policymakers
UD (n = 420) <=20
SCD (n = 182) 12%
CF (n = 193) <=20
Infectious (n = 5737) 8%
Chronic (n = 3417) 12%
Prescriptions too expensive in last 12 months Surveys (n = 178 102) 11% Social workers, advocacy groups, clinicians, policymakers
UD (n = 420) 8%
SCD (n = 182) 20%
CF (n = 193) 19%
Infectious (n = 5737) 11%
Chronic (n = 3417) 21%
Insurance acceptance problems Surveys (n = 178 102) 11% Social workers, advocacy groups, clinicians, policymakers
UD (n = 420) 5%
SCD (n = 182) 18%
CF (n = 193) <=20
Infectious (n = 5737) 13%
Chronic (n = 3417) 16%
Social determinants of health survey (SDH) Community support with medical visits or meal preparation when needed SDH survey (n = 117 783) >99%* Families, social workers
UD (n = 338) >75%* 
SCD (n = 98) >75%*
CF (n = 107) >75%*

RUD participant data from multiple All of Us data categories. “Surveys” under the cohort column denotes All of Us population that completed the PFHH, basics, and HAU surveys; “*” denotes data that were coarsened to comply with All of Us data dissemination policy; “Infectious Pool” includes individuals with Lyme disease or tuberculosis; “Chronic Pool” includes individuals with systemic lupus, MD, multiple sclerosis, or Amyotrophic Lateral Schlerosis.

Abbreviations: CF, cystic fibrosis; EHR, electronic health record; MD, muscular dystrophy; PFHH survey, personal and family health history; RUD, rare and undiagnosed disease; SCD, sickle cell disease; UD, undiagnosed disease.

Discussion

Our descriptive epidemiology study shows that All of Us is a valid resource for RUD data since All of Us (a) contains data for many RUDs and (b) has individuals with multiple accompanying types of actionable RUD-relevant data. These data complement earlier All of Us RD work18 with comparable prevalence data. This database therefore reflects what would normally be collected by several registries, overcoming certain registry limitations44 since data are streamlined and participant protection is carefully monitored. Further, because these diseases are rare, it can be difficult to find others with the same disease45 and the PDB provides an opportunity for anyone (eg, patients) to easily know if there are individuals with specific RDs in All of Us. Indeed, it is unusual to have a single data source that includes so many different types of data for each participant, and even for relatives.

This study demonstrates the versatility of All of Us RUD data by identifying participants with data that could be actionable for multiple stakeholders. The employment, insurance, social satisfaction, mental health, medication records, healthcare access, and community support data from All of Us certainly have implications for many stakeholders (Table 2). We hope that this highlights that All of Us is like a combination of registries and that this prompts further characterization of All of Us to advance the understanding of the natural history of RUDs, chronicling of diagnostic and treatment strategies and increased awareness, and direction to address financial and other challenges.44 There are also policy implications for All of Us RUD data, where, for example, states choose which of the nationally RUSP newborn screening diseases to adopt,46 and so the presence of RUSP data in All of Us could be used to support state policy campaigns. Mental health, healthcare access, and other data for the RUDs assessed in All of Us show varied trends, some of which are consistent with the literature27,47–52 (reviewed in Chung et al12). This study adds to mental health and other research53,54 indicating that All of Us can be used for advocacy. Artificial Intelligence has been used to mine All of Us to identify new uses for drugs,55 and RD data in All of Us could be used similarly and add to RD-focused and broader efforts (eg,56). While we only addressed a few survey questions, there are many others that can highlight more nuances in future studies. The diagnostic odyssey is a problem for the RUD community12 and All of Us has several UD participants with data that can be explored. A closer investigation of the subset of UD participants with comorbid RD diagnosis may yield insights into the diagnostic odyssey. All of Us genomic data have shown the presence of new variants57 and thus hold promise for solutions for the UD population.58

While these findings indicate that All of Us is indeed a compelling source of RUD data, we recognize that there are limitations. For example, currently, All of Us only has data from adults, and so childhood RDs are not included,59 (also noted by Zeng et al18) which may soon be resolved with the anticipated inclusion of children.60 There are many other RUDs absent from All of Us, including those investigated in this study (Table 1, Tables S1 and S2), which had no participants. Additionally, a large amount of data gathered in this study were collected via survey, introducing potential self-reporting bias. Nevertheless, the existence of this rich source of RUD data holds many promises. For example, while our conclusions are limited due to the cross-sectional nature of this study, especially for the RID, mental health, and comorbidity of undiagnosed and RDs, All of Us does contain temporal data that could be explored in future studies. It has been highlighted that All of Us does not reflect the prevalence of the US population because some groups are over- or underrepresented,18,61–64 limiting generalizability. Our findings (Table 1) and those by Zeng et al18 suggest that the prevalence of some RDs in All of Us may be higher than in the broader population, and we posit that this is a huge advantage of this resource, given the limited data on RUDs.

Finally, this study highlights some of the RUD descriptive epidemiology present in All of Us and hopefully will encourage others to conduct similar work and further pursue analytic epidemiology using this rich data source. RUD stakeholders can find data for undiagnosed and specific RDs, or data relevant to RDs collectively from All of Us to help advance the work to support this community.

Supplementary Material

ocae241_Supplementary_Data

Acknowledgments

We thank the All of Us Biomedical Research Scholars Program (Baylor College of Medicine All of Us Evenings with Genetics) and the CURing All Of Us Team for supporting AT. We also appreciate the assistance provided by the All of Us Evenings with Genetics Office Hours.

We gratefully acknowledge All of Us participants for their contributions, without whom this study would not be possible. We also thank the National Institutes of Health’s All of Us Research Program for making available the participant and cohort data examined in this study.

The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA #: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276.

Contributor Information

Drenen J Magee, Department of Biological and Health Sciences, Crown College, St Bonifacius, MN 55375, United States.

Sierra Kicker, Department of Biological and Health Sciences, Crown College, St Bonifacius, MN 55375, United States.

Aeisha Thomas, Department of Biological and Health Sciences, Crown College, St Bonifacius, MN 55375, United States.

Author contributions

Drenen J. Magee and Sierra Kicker created cohorts, analyzed data, and edited the manuscript. Aeisha Thomas supervised the project, did some data collation, wrote the original draft, and edited the manuscript.

Supplemental material

Supplementary materialis available at Journal of the American Medical Informatics Association online.

Funding

Support for AT was from the All of Us Evenings with Genetics Research Program under award number EWG-23-CE-313. The program, in part, is funded by the NIH All of Us Research Program, 1 OT2 OD031932-01.

Conflicts of interest

The authors have no competing interests.

Data availability

This study uses data from the All of Us Research Program’s Controlled Tier version 7, available to authorized users on the Researcher Workbench. The data are available in the All of Us Workspace and can be shared with registered users on request.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocae241_Supplementary_Data

Data Availability Statement

This study uses data from the All of Us Research Program’s Controlled Tier version 7, available to authorized users on the Researcher Workbench. The data are available in the All of Us Workspace and can be shared with registered users on request.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES