Skip to main content
. Author manuscript; available in PMC: 2023 Sep 27.
Published in final edited form as: Annu Rev Biomed Data Sci. 2023 Apr 27;6:153–171. doi: 10.1146/annurev-biodatasci-020722-020704

Table 1.

Examples of data inequality in biomedical datasets

Dataset Disease/phenotype Data type Ethnicity compositiona Reference(s) URL
ADNI-3 Alzheimer’s disease Genotype and image Caucasian 86%, other 14% 118, 119 https://adni.loni.usc.edu/
GENIE Cancers Genomic variation White 87%, Black or African American 6%, Asian 5%, other 2% 120 https://www.aacr.org/professionals/research/aacr-project-genie/
GTEx (v8) Gene expression in normal tissues Genotype and transcriptome White 85%, African American 13%, Asian 1%, unknown 1% 121, 122 https://gtexportal.org/
GWAS Various Genotype and phenotype European 88%, Asian 8%, African, African American or Afro-Caribbean 2%, Hispanic or Latin American 1%, other/mixed 1% 16 https://gwasdiversitymonitor.com/
Million Veteran Program Various Genotype and electronic health record European 70%, African 19%, admixed American 9%, Asian 2% 4, 123 https://www.mvp.va.gov/pwa/
MIMIC-IV Various Electronic health record White 77%, Black or African American 10%, Asian 3%, Hispanic/Latino 4%, other 6% 62, 124, 125 https://doi.org/10.13026/07hj-2a80
SHHS Cardiovascular diseases related to sleep-disordered breathing Electronic health record White 86%, Black 9%, other 5% 126, 127 https://sleepdata.org/datasets/shhs
TARGET Pediatric cancers Multiomics White 80%, Black or African American 13%, Asian 5%, other 2% NAb https://ocg.cancer.gov/programs/target
TCGA Cancers Multiomics European ancestry 82%, African ancestry 6%, East Asian ancestry 6%, admixed ancestry 4%, other 2% 33 https://cancergenome.nih.gov/
UK Biobank Various Genotype, genome sequence, and electronic health record European ancestry 95%, African ancestry 2%, Central/South American ancestry 2%, East Asian ancestry 1% 128 https://www.ukbiobank.ac.uk/
a

The original terms from the information sources are used. The percentages were calculated using the patients with known race/ethnicity/ancestry information (as of August 2022).

b

Ethnicity composition numbers for TARGET were derived from the NCI Genomic Data Commons (https://portal.gdc.cancer.gov/).

Abbreviations: ADNI, The Alzheimer’s Disease Neuroimaging Initiative; GENIE, Genomics Evidence Neoplasia Information Exchange; GTEx, The Genotype-Tissue Expression Project; GWAS, genome-wide association studies; MIMIC-IV, Medical Information Mart for Intensive Care, version IV; NA, not any; NCI, National Cancer Institute; SHHS, Sleep Heart Health Study; TARGET, Therapeutically Applicable Research to Generate Effective Treatments; TCGA, The Cancer Genome Atlas.