Skip to main content
Human Genetics and Genomics Advances logoLink to Human Genetics and Genomics Advances
. 2023 Sep 14;4(4):100239. doi: 10.1016/j.xhgg.2023.100239

Evaluating genomic polygenic risk scores for childhood acute lymphoblastic leukemia in Latinos

Soyoung Jeon 1, Ying Chu Lo 1, Libby M Morimoto 2, Catherine Metayer 2, Xiaomei Ma 3, Joseph L Wiemels 1, Adam J de Smith 1, Charleston WK Chiang 1,4,5,
PMCID: PMC10550840  PMID: 37710962

Summary

The utility of polygenic risk score (PRS) models has not been comprehensively evaluated for childhood acute lymphoblastic leukemia (ALL), the most common type of cancer in children. Previous PRS models for ALL were based on significant loci observed in genome-wide association studies (GWASs), even though genomic PRS models have been shown to improve prediction performance for a number of complex diseases. In the United States, Latino (LAT) children have the highest risk of ALL, but the transferability of PRS models to LAT children has not been studied. In this study, we constructed and evaluated genomic PRS models based on either non-Latino White (NLW) GWAS or a multi-ancestry GWAS. We found that the best PRS models performed similarly between held-out NLW and LAT samples (PseudoR2 = 0.086 ± 0.023 in NLW vs. 0.060 ± 0.020 in LAT), and can be improved for LAT if we performed GWAS in LAT-only (PseudoR2 = 0.116 ± 0.026) or multi-ancestry samples (PseudoR2 = 0.131 ± 0.025). However, the best genomic models currently do not have better prediction accuracy than a conventional model using all known ALL-associated loci in the literature (PseudoR2 = 0.166 ± 0.025), which includes loci from GWAS populations that we could not access to train genomic PRS models. Our results suggest that larger and more inclusive GWASs may be needed for genomic PRS to be useful for ALL. Moreover, the comparable performance between populations may suggest a more oligogenic architecture for ALL, where some large effect loci may be shared between populations. Future PRS models that move away from the infinite causal loci assumption may further improve PRS for ALL.

Keywords: acute lymphoblastic leukemia, latinos, polygenic risk scores, risk prediction


This study assessed and informed the approach to construct the most accurate polygenic risk prediction models of acute lymphoblastic leukemia (ALL) currently available for Latino individuals, who have the highest risk for ALL in the United States.

Introduction

Acute lymphoblastic leukemia (ALL) is the most common type of childhood cancer worldwide, representing 20% of all cancers in children in the United States.1 There are few established environmental risk factors for ALL, and genome-wide association studies (GWASs) have confirmed the contribution of genetic variation to ALL risk. To date, at least 19 loci have been discovered and replicated in previous GWASs, primarily performed with European ancestry individuals, suggesting the polygenic nature of susceptibility to ALL.2,3,4,5,6,7,8,9,10,11,12 Yet, how these variants collectively contribute to disease risk has not been fully characterized.

Polygenic risk scores (PRSs) can identify individuals at significantly elevated risk for a disease, such as cancer, by providing a quantitative measure of an individual’s inherited risk based on the cumulative impact of variants shown to be associated with the disease of interest. Moreover, there has been growing evidence that the predictive power of PRSs can be further increased by aggregation of genotypic effects across all variants even if they do not reach the commonly acknowledged genome-wide significance threshold for association (p = 5e-8).13,14 With ALL, this genomic PRS approach may enhance the efficacy of PRS models given the small number of known susceptibility loci.

However, one of the biggest limitations of PRSs is the lower predictive performance in non-European ancestry populations.15 Part of this loss in efficacy may be due to the over-representation of GWAS participants of European ancestry,15,16 resulting in much more informative GWASs for European ancestry individuals compared with that for other ancestries. The poor transferability may also arise due to differences between populations in terms of the patterns of linkage disequilibrium (LD), and the number, magnitude of effect, and the frequencies of the causal alleles.15,17,18 Such a limitation is particularly important for ALL, since Latino children have a higher and faster-increasing risk and poorer survival than non-Latino Whites (NLWs) in the United States.19,20,21,22,23 Currently available PRS models for ALL are based only on a limited number of known risk alleles. One of the first PRS models for ALL was one constructed with 11 single nucleotide polymorphisms (SNPs) known to be associated with ALL as of 2018, with effect sizes estimated from a European ancestry cohort.24 Its efficacy in individual risk discrimination analysis may be over-estimated, and its transferability to non-European cohorts has not been evaluated. A subsequent PRS model reported in 2021 using again only SNPs from known associated loci from multi-ancestry GWASs showed lower predictive performance than the earlier study, though it demostrated similar performance between Latinos and non-Latino White cohorts.2 No study has constructed genomic PRS models for ALL in any population to date.

In this study, we set out to construct and evaluate genomic PRS models derived using NLW cohorts and test their transferability to Latino (LAT) individuals. We evaluated two genomic PRS approaches—Pruning and Thresholding (P + T) and LDPred2—in parallel to PRS models constructed based on only genome-wide significant loci from the literature. We also aimed to examine whether effect sizes estimated from ethnic-specific GWASs or multi-ancestry meta-analysis, and whether training with matched ancestry LD reference panel, could improve the efficacy of the PRS.

Material and methods

Study cohort

The California Childhood Cancer Record Linkage Project (CCRLP) includes all children born in California during 1982–2009 and diagnosed with ALL at the age of 0–14 years per California Cancer Registry records from 1988 to 2011. Children who were born in California during the same period and not reported to California Cancer Registry as having any childhood cancer were considered potential controls. Detailed information on sample matching, preparation, and genotyping has been previously described.4 Because ALL is a rare childhood cancer, to increase statistical power of a genetic study we followed previous practice4 and incorporated additional controls using adult individuals from the Kaiser Resource for Genetic Epidemiology Research on Aging Cohort (GERA; dbGaP accession: phs000788.v1.p2). The GERA cohort was chosen because a very similar genotyping platform had been used.4 Both studies included data on self-reported race/ethnicity from birth certificate or upon cohort entry, which were used to perform stratified multi-ancestry GWASs.

The imputation and quality control (QC) of SNP array data were carried out in each study population, as previously described in a multi-ancestry meta-analysis GWAS of ALL.2 After QC filtering, the LAT GWAS included 1,878 cases and 8,441 controls, the NLW GWAS included 1,162 cases and 57,341 controls, the African American GWAS included 124 cases and 2,067 controls, and the East Asian GWAS included 318 cases and 5,017 controls.

Another GWAS was performed with individuals from the Children’s Oncology Group (COG; dbGAP accession: phs000638.v1.p1) as cases and from the Wellcome Trust Case-Control Consortium (WTCCC) as controls.25 We generally followed the same QC pipeline, but because self-reported race/ethnicity was not available to us, we performed global ancestry estimations using ADMIXTURE and the 1000 Genomes populations as reference. We removed individuals with <90% estimated European ancestry from the analysis, resulting in a total of 1,504 and 2,931 NLW cases and controls, respectively. This dataset was previously used as a replication cohort of European ancestry in our earlier study,2 but here we combined it with CCRLP NLW to increase the sample size of the discovery GWAS (below). We note that filtering based on genetically inferred ancestry for COG/WTCCC was due to a logistical constraint. We would expect that enriching for European ancestry through an arbitrary threshold may artificially increase genetic differentiation between the discovery GWAS (in NLW) and the targeted validation cohort (in LAT), thereby potentially overestimating any difference of performance for PRS models between populations.

The California Childhood Leukemia Study (CCLS),12,26 a non-overlapping California case-control study with controls selected from California birth records (1995–2008), was used as our validation dataset. In total, 306 NLW cases, 258 NLW controls, 592 LAT cases, and 509 LAT controls, based again on self-reported race/ethnicity at birth, were available for analysis. The QC procedures and imputation were performed in accordance with the discovery/training dataset.

This study was approved by institutional review boards at the California Health and Human Services Agency, University of Southern California, Yale University, University of California, Berkeley, and the University of California, San Francisco. The de-identified newborn dried blood spots for the CCRLP were obtained with a waiver of consent from the Committee for the Protection of Human Subjects of the State of California. The CPHS IRB Project number is 2018-118.

Overall study design

A PRS of an individual j is defined as a weighted sum of SNP allele counts:

PRS=i=1mβiˆgij,

where m is the number of SNPs to be included in the predictor, βiˆ in the per allele weight for each SNP, gij is the allele count (0,1,2) or dosage of the allele of SNP in individual j.

For each step of score derivation, optimization, and evaluation, we used three non-overlapping datasets to (1) perform discovery GWAS to estimate variant effect sizes, (2) optimize parameters for the best predictive score, and (3) evaluate the predictive performance of the resulting scores (Figure 1). Following the convention previously suggested,27,28 we refer to the datasets used in each of the three steps as “GWAS,” “testing,” and “validation” datasets.

Figure 1.

Figure 1

Summary of study design and analysis

The flowchart details different cohorts used for each step of PRS derivation with different discovery GWAS, optimization cohort, and evaluation in either non-Latino White or Latino populations. In PRS evaluation, comparison (1) focused on evaluating the transferability of PRS models optimized in non-Latino White cohort. Comparison (2) focused on different strategies for improving the PRS efficacy by optimizing in a Latino cohort (2a), using a Latino-only discovery GWAS (2b), using a multi-ethnic discovery GWAS and optimized in non-Latino White (2c) or Latinos (2d). NLW, non-Latino White; LAT, Latino American; CCRLP, California Childhood Cancer Record Linkage Project; GERA, Genetic Epidemiology Research on Aging Cohort; COG, Children’s Oncology Group; WTCCC, Wellcome Trust Case-Control Consortium; CCLS, California Childhood Leukemia Study.

We randomly selected and held-out 360 cases and 1,200 controls from each of CCRLP NLW (∼13.5% of the total cases and ∼2.0% of the total controls) and LAT (∼19.2% of the total cases and ∼14.3% of the total controls) as the testing datasets to identify the best PRS models, and used the remaining sample from CCRLP+GERA cohort as the GWAS dataset in the three different discovery GWASs: (1) NLW-only meta-analysis (combined with COG+WTCCC sample), (2) LAT-only GWAS, and (3) multi-ancestry meta-analysis. For each GWAS, we constructed PRS using two established approaches: Pruning and Thresholding (P + T) and LDPred2.

Discovery GWAS

We used PLINK (version 2.3 alpha) to test the association between imputed genotype dosage at each SNP and case-control status in logistic regression, after adjusting for the top 20 principal components (PCs) to control for potential confounding due to fine-scale structure and variation in genetic ancestry within each ethnic group. For NLW and multi-ancestry GWAS meta-analysis, the results from each study and/or racial/ethnic group were combined via the fixed-effect meta-analysis with variance weighting using METAL.29 For CCRLP/GERA, after excluding 360 cases and 1,200 controls each for NLW and LAT (as the held-out/testing samples), we included 802 cases and 56,141 controls in NLW, 1,518 cases and 7,210 controls in LAT, 318 cases and 5,017 controls in East Asian, and 124 cases and 2,067 controls in African American for discovery GWAS. For NLW meta-analysis, CCRLP/GERA GWAS was meta-analyzed with a separate GWAS conducted with 1,504 cases and 2,931 controls from a COG/WTCCC cohort, for a total sample size of 2,306 cases and 59,072 controls. Multi-ethnic meta-analysis was conducted with CCRLP/GERA NLW, LAT, East Asian (EAS), African American (AFR), and COG/WTCCC individuals, totaling 4,266 cases and 73,366 controls. While a single pooled multi-ancestry GWAS may be more powerful, in this study we opted for a multi-ancestry meta-analysis in part because the GERA subcohorts were genotyped on different versions of ancestry-specific Axiom arrays,30 necessitating QC processing stratified by self-reported race/ethnicity and the genotyping platform. The total sample size for each discovery GWAS design can be found in Table S1.

PRS derivation/optimization

For each ancestry-specific or multi-ancestry GWAS, we constructed the PRS using two different methods: Pruning and Thresholding (P + T) and LDPred2. Both methods used the GWAS summary statistics as the starting point, but each makes different choices for which SNPs to include in the predictor and the weight values assigned to each SNP.

Pruning and Thresholding (P + T) uses a p value threshold and LD-driven clumping procedure to construct scores. The scores using P + T approach were constructed using PLINK (version 1.9). In brief, given a user-defined threshold for associated p value and clumping parameters, the algorithm forms clumps around the index SNPs with all SNPs within a specified distance (kb) that have p value and pairwise LD (measured by r2) at levels greater than a specified threshold. The algorithm greedily and iteratively cycles through all index SNPs, beginning with the SNP with the most significant p value, only allowing each SNP to appear in one clump. The most significant SNPs for each LD-based clump across the genome are used to build the PRS with associated estimated effect sizes, βˆ, as weights. We constructed PRS using a range of p values (1.0, 0.5, 0.05, 5 × 10−4, 5 × 10−6, and 5 × 10−8), r2 (0.2, 0.4, 0.6, and 0.8), and kb (250, 500) thresholds for a total of 48 PRS models to optimize under this approach.

LDPred2 uses a Bayesian approach to calculate posterior mean effect size for each variant given a prior and subsequent shrinkage based on the extent to which the variant is correlated with similarly associated variants.31,32 The underlying Gaussian distribution additionally considers the proportion of causal variants (ρ). LDPred2 uses a grid of values for hyper-parameters/tuning parameter - ρ, h2 (the SNP heritability), and sparsity (whether to fit some variant effects to exactly zero) to construct PRS. We used ρ from a sequence of 17 values from 10−4 to 1 on a log-scale, a range of h2 within (0.7, 1, 1.4) × estimated heritability, and a binary sparsity option of either on and off (LDPred2-grid models). In addition, we tested a model assuming infinitesimal causal effects, where each variant is assumed to contribute to disease risk (LDPred2-inf model). In total, we evaluated 103 PRS models using LDPred2.

Once the variants and weights for each PRS model were estimated, the scores were generated in the testing sample (360 cases and 1,200 controls in NLW or LAT) using PLINK (version 2.3 alpha), and then standardized to have a mean of 0 and variance of 1. For each strategy, the score with the best predictive performance was determined based on the highest Negelkerke’s pseudo R2 (the proportion of variance explained) which was calculated as the difference of R2 from a full model inclusive of the PRS and the covariates and the R2 from a null model with covariates alone. Covariates in the model included the first 20 PCs and sex.

PRS evaluation

After optimizing the PRS model in held-out testing samples of 360 cases and 1,200 controls, we computed the PRS score in the CCLS, which is our validation dataset. The CCLS included 306 cases and 258 controls in the NLW subcohort, and 592 cases and 509 controls in the LAT subcohort. We quantified the predictive performance of PRS by Negelkerke’s pseudo R2 and area under receiver operating characteristic curve (AUC; probability that a case ranks higher than a control). We assessed the transferability of a PRS model between populations by testing for statistical differences in its performance measures between populations.33 AUC was computed for the full model with covariates to account for population stratification. AUC for the null model (ALL ∼10 PCs + sex) is 0.593 and 0.577 in CCLS LAT and NLW, respectively. AUCs were calculated using pROC package in R.34 Standard errors and tests for statistical differences in these measures of model performance were computed with 1,000 sets of bootstrap samples across individuals and populations.

In the case for evaluating transferability of the best PRS model for NLW_NLW strategy to LAT, we additionally used CCRLP LAT as a validation sample, stratified by global European ancestries. Local ancestry inference was first performed on CCRLP LAT cases and controls using RFMix,35 using a reference panel consisting of 671 non-Finnish European individuals for European ancestry, 716 African individuals for African-ancestry, and 94 Admixed American individuals (7 Colombian, 12 Karitianan, 14 Mayan, 4 Mexican in Los Angeles, 37 Peruvian in Lima, Peru, 12 Pima, and 8 Surui) for Indigenous American (IA) ancestry from gnomAD v3.1 release,36 as previously identified to be enriched with indigenous ancestry.37 We then summed the local ancestry estimates across the genome to derive the global ancestry estimates. We stratified Latino individuals into three tertiles of global European ancestry, and in each group evaluated the predictive performance of the best PRS model for NLW_NLW strategy.

Results

To develop PRS models, we used three non-overlapping datasets to (1) perform discovery GWAS to estimate variant effect sizes, (2) optimize parameters in held-out samples for the best predictive score, and (3) evaluate the predictive performance of the resulting scores in external validation cohort (Figure 1). We explored multiple strategies to develop PRS models. We labeled the different strategies using the convention of “POPGWAS_POPtesting”, where POPGWAS is the population in which the discovery GWAS was conducted, and POPtesting is the population in which the optimization for the best model was performed (material and methods).

Transferability of genomic PRS for ALL

We first evaluated a genomic PRS model derived from GWAS summary statistics of an NLW cohort and its transferability to the LAT cohort. We performed a GWAS in 2,306 cases and 59,072 controls in NLW (Table S1) after holding out individuals for testing and validating the PRS models. Our first design is termed NLW_NLW, for the discovery GWAS was performed in NLW, and the model was optimized also in held-out NLW samples (strategy 1, Figure 1). This is a typical scenario where GWAS and PRS model optimizations were both completed in European ancestry populations.

The best model with the highest Negelkerke’s Pseudo R2 in the NLW_NLW approach was based on LDPred2, a non-sparse model with ρ = 0.0032 and h2=0.22. This model consisted of approximately 1.08M SNPs across the genome and is significantly associated with case/control status in both CCLS NLW and LAT cohorts (p = 4.1e-9 and 3.9e-12 for NLW and LAT, respectively). The resulting PRS explained 8.6% ± 3.2% of the variance in the CCLS NLW cohort, after accounting for covariates, as measured by pseudo R2 (Table 1). The same PRS model explained 6.0% ± 2.0% of the variance in the CCLS LAT cohort (Table 1 and Figure 2), suggesting minimal loss of transferability in efficacy after taking into account the standard errors of these estimates. The AUC in both NLW and LAT are also similar (0.667 ± 0.045 and 0.652 ± 0.032 in NLW and LAT, respectively), in the full prediction model, including PRS as well as sex and 20 PCs (Table 1).

Table 1.

Performance of the best model for NLW_NLW strategy across different testing datasets

Testing dataset Sample size p value AUC SE_AUC PseudoR2 SE_PseudoR2
CCLS NLW 564 4.13E-09 0.667 0.045 0.086 0.032
CCLS LAT 1101 3.95E-12 0.652 0.032 0.060 0.020
CCRLP LAT (high EUR) 1300 3.59E-09 0.624 0.030 0.036 0.016
CCRLP LAT (medium EUR) 1301 2.68E-12 0.629 0.030 0.051 0.017
CCRLP LAT (low EUR) 1300 1.79E-09 0.617 0.030 0.037 0.016

p value denotes the evidence of association of the PRS in a logistic regression model with additional covariates of 20 PCs and sex. AUC denotes area under the curve from receiver operator characteristic analysis. PseudoR2 was calculated from the difference between a logistic regression model with PRS and one without PRS. SE denotes standard error for both AUC and PseudoR2, which were computed using 1,000 bootstrap samples. high, medium, and low EUR denote the top, middle, and bottom tertile, respectively, of CCRLP LAT individuals sorted by proportion of estimated European ancestries.

Figure 2.

Figure 2

PRS efficacy based on the best-performing model for each strategy, as validated in CCLS Latinos

The PRS efficacy, as measured by pseudo R2, of the best-performing model for each strategy aimed to improve the efficacy of PRS models for LAT is summarized and compared with a baseline model. Each strategy is labeled by the convention we used in this study, POPGWAS_POPtesting, where POPGWAS is the population in which discovery GWAS was conducted, and POPtesting is the population in which the optimization of the model was performed. Each model is also numbered (1, 2a–2d) according to the strategy design in Figure 1. In all cases the PRS models were validated in CCLS LAT. ∗Strategy 2b (LAT_LAT) is significantly better than the baseline model (p = 0.0019). ∗∗Strategies 2c and 2d (META_NLW and META_LAT, respectively) are both better than the baseline model as well (p < 1e-4). Standard errors and statistical significance is computed by 10,000 rounds of bootstrap samples across individuals. No other comparisons produced statistically significant results.

An alternative approach to evaluate the transferability of the genomic PRS model is to assess if the prediction efficacy differs by proportion of European ancestry in the Latino individuals. Because the CCRLP has the largest collection of LAT individuals and has not been used in the NLW_NLW model, we can evaluate the prediction accuracy in CCRLP LAT individuals (N = 3,901; 1,878 cases). In tertiles of LAT individuals, each with approximately 1,300 individuals, we found little evidence of differences in performance across strata of ancestry proportions (Pseudo R2 = 0.036, 0.051, 0.037 across the highest, middle, and lowest tertiles by European ancestries in LAT; AUC = 0.624, 0.629, and 0.617, respectively; Table 1). Taken together, we identified little evidence that there is a substantial difference in transferability between NLW and LAT populations or ancestries.

Improving the prediction accuracy of genomic PRS for Latinos

We first evaluated a scenario where the LAT was used as the cohort to identify the optimal PRS model, even though the discovery GWAS was still from NLW (NLW_LAT, strategy 2a in Figure 1). We found that in this case, the best model was a LDPred2 sparse model with parameters ρ = 0.01 and h2 = 0.1826. This model did not appear to improve the performance of the PRS in CCLS LAT over the best NLW_NLW model (Pseudo R2 = 0.041 ± 0.018, compared with 0.060 ± 0.020 under NLW_NLW approach; Figure 2 and Table S2).

We also evaluated a scenario where the LAT were used both for the discovery GWAS and PRS model optimization. In this case, 1,518 cases and 7,210 controls of LAT individuals from CCRLP+GERA were used in the discovery GWAS (LAT_LAT strategy; 2b in Figure 1). The best PRS model from this approach was a LDPred2 sparse model with parameters ρ = 0.001 and h2 = 0.1764. When validating this model in CCLS, the performance was significantly better than when the NLW had been used for discovery GWAS (Pseudo R2 = 0.116 ± 0.026, compared to 0.060 ± 0.020 under the NLW_NLW strategy, p = 0.0019; Figure 2 and Table S2).

Finally, as discovery GWAS based in NLW or LAT are both potentially underpowered, we also evaluated the multi-ancestry meta-analysis design that combined all four cohorts from CCRLP+GERA as well as the COG+WTCCC samples (Table S1). In total, the GWAS contained 4,226 cases (2,306, 1,518, 318, and 124 in NLW, LAT, EAS, and AFR, respectively) and 73,366 controls (59,072, 7,210, 5,017, and 2,067 in NLW, LAT, EAS, and AFR, respectively). We then trained the best genomic PRS model in either NLW (META_NLW; strategy 2c in Figure 1) or LAT (META_LAT; strategy 2d in Figure 1), both using a held-out sample of 360 cases and 1,200 controls. Likely due to the increased sample sizes, the meta-analysis designs produced the best-performing genomic PRS models. Under the META_NLW design, the best model was a LDPred2 sparse model with parameters ρ = 0.0032 and h2 = 0.1376. Under this model, the prediction accuracy in CCLS LAT was better than the naive NLW_NLW strategy (Pseudo R2 = 0.131 ± 0.025 vs. 0.060 ± 0.020, p < 1e-4; Figure 2 and Table S2), and slightly though not significantly higher than the LAT_LAT strategy (Pseudo R2 = 0.116 ± 0.026; p = 0.15). The best META_LAT strategy (a non-sparse model with parameters ρ = 0.001 and h2 = 0.1127) also appeared to perform similarly compared with the META_NLW approach (Pseudo R2 = 0.130 ± 0.024; Figure 2 and Table S2). The AUC for the full model including PRS, sex, and PCs were 0.700 and 0.701 for META_NLW and META_LAT strategies, respectively. Our results thus suggest that given the currently available data, combining the largest multi-ethnic sample for discovery GWAS will lead to the best genomic PRS model in terms of prediction accuracy.

As the multi-ancestry meta-analysis GWAS is the most powerful discovery GWAS currently available, we also evaluated the transferability of the PRS model from the META_NLW strategy by comparing the PRS performance in CCLS NLW vs. LAT samples. The Pseudo R2 remains comparable between the two cohorts (e.g., under the META_NLW strategy, Pseudo R2 = 0.153 ± 0.034 for NLW vs. 0.131 ± 0.025 for LAT; Figure 3 and Table S3). This result is consistent with the attempt described above (NLW_NLW strategy; Table 1).

Figure 3.

Figure 3

Predictive performance of best-performing genomic PRS model (META_NLW) vs PRS model constructed with 23 known ALL risk SNPs

Both models were tested in CCLS NLW and LAT, which were not used in the discovery GWAS for genomic PRS. CCLS was also not used in the identification of the 23 known loci in literature, although it had been used as replication cohort. ∗In CCLS LAT, the pseudo R2 for PRS model based on 23 known loci is significantly better than that from our best-performing genomic PRS model (p < 1e-4) based on 10,000 sets of bootstrapping.

Genetic architecture of ALL

LDPred2 has two different modes of inference, LDPred2-grid and LDPred2-inf, where the former assumes some proportions of the variants are causal and parameters need to be optimized in a grid, while the latter assumes an infinitesimal model where every variant have a mean effect of 0 with some small variance. In our META_NLW and META_LAT approaches, where we have the most powerful discovery GWAS to guide PRS model constructions, we noticed that LDPred2 models consistently outperformed the LDPred2-inf models (e.g., Pseudo R2 = 0.130 ± 0.025 in LDPred2 vs. 0.013 ± 0.016 in LDPred2-inf when model under the META_LAT strategy was evaluated in CCLS LAT; Figure S1 and Table S3). Our results are thus consistent with a more oligogenic architecture of ALL, while LDPred2-inf is more appropriate for traits with highly polygenic inheritance.

Genomic PRS vs. PRS based on genome-wide significant loci

Generally speaking, genomic PRS models, whether through P + T, LDPred2, or other similar approaches, are expected to be more accurate in risk prediction or stratification over a simple PRS model based solely on the set of known GWAS loci (i.e., those that have been shown to reach a p value less than 5e-8 in one or more GWAS for a particular trait).31 Indeed, in each of the strategies that we have examined, the best genomic PRS models tend to be better than P + T model with p value threshold of 5e-8, a special case that is equivalent to building a PRS model with the genome-wide significant loci. For instance, under the META_LAT strategy, the best PRS model achieved a pseudo R2 of 0.130 ± 0.024, while the best P + T model with p value threshold of 5e-8 only attained a pseudo R2 of 0.088 ± 0.021.

However, the genomic PRS requires a held-out sample to optimize the parameters for building the PRS. This necessitates a reduction in the sample sizes available for GWAS. While this may not be a huge obstacle for common diseases, it could be a concern for a rare disease such as ALL. In order to evaluate the genomic PRS, we had to reduce our case proportions by 16% (from 2,666 cases to 2,306 cases and from 1,878 cases to 1,518 cases after removing 360 cases each for NLW and LAT respectively from CCRLP/GERA as training sample). Thus, an alternative approach could have been constructing a simple PRS model based on only genome-wide significant variants, and subsequently test this PRS model in independent cohorts.

We built a PRS model using 23 SNPs previously associated with ALL, identified across 11 studies.2,3,4,5,6,7,8,9,10,11,12 These 23 SNPs were derived from 19 loci, including conditionally independent secondary associations at four loci (Table S4). These associated SNPs were identified in one or more independent cohorts in the literature, including the full CCRLP/GERA datasets that were used for constructing and evaluating genomic PRS models above. Because there is no need to optimize the PRS model in held-out samples, we directly tested this “conventional” PRS in the independent CCLS cohort that were not used in the discovery of any of these 23 loci (although they had been used as part of the replication cohort in previous studies). This strategy produced better prediction accuracy than the best-performing genomic PRS models in CCLS LAT (Pseudo R2 = 0.166 ± 0.025; AUC = 0.726 compared with Pseudo R2 = 0.131 ± 0.025 from genomic PRS derived using the META_NLW strategy, p < 1e-4; Table 2 and Figure 3), a difference that was not seen between the conventional PRS and genomic PRS tested in CCLS NLW (Pseudo R2 = 0.151 ± 0.034; AUC = 0.706 compared with Pseudo R2 = 0.153 ± 0.034 from genomic PRS derived using the META_NLW, Table 2 and Figure 3).

Table 2.

Predictive performance of best-performing genomic PRS model vs. conventional model constructed with 23 known ALL risk SNPs

CCLS NLW
CCLS LAT
PseudoR2 SE AUC PseudoR2 SE AUC
Conventional PRS 0.151 0.034 0.706 0.166 0.025 0.726
Genomic PRS 0.153 0.034 0.710 0.131 0.025 0.700

Conventional PRS is a model based on 23 SNPs in literature known to be associated with ALL, having passing the genome-wide significance threshold of 5e-8 in GWAS. PseudoR2 was calculated from the difference between a logistic regression model with PRS and one without PRS. SE denotes standard error, estimated from 1,000 bootstrap samples.

Discussion

In the current study, we leveraged the largest available multi-ancestry meta-analysis GWAS to investigate strategies to build and evaluate PRS models for ALL across populations. We evaluated the extent of loss in efficacy for PRS models trained solely in NLW populations but applied in LAT populations, explored approaches to improve PRS models for LAT through different optimization strategies, and compared the genomic PRS models against a simple model that used all previously reported genome-wide significant ALL-associated variants with no optimization. We found little evidence of a loss in efficacy when transferring the genomic PRS model between populations. We also found that while leveraging multi-ancestry information to increase GWAS sample sizes and representation could lead to much more effective genomic PRS models (pseudo R2 = 0.131 ± 0.025, AUC = 0.700), this model currently still has lower prediction accuracy for Latinos compared with a simple model of using only 23 known ALL-associated SNPs (pseudo R2 = 0.166 ± 0.025, AUC = 0.726) that were derived from multiple independent cohorts in literature (including ones we do not have access to and not utilized in this study for building genomic PRS).

We undertook multiple analytical approaches to evaluate the transferability of a PRS model for ALL, but generally found little evidence of loss in efficacy across populations. After determining the best predictive PRS models in NLW, using either the CCRLP NLW (NLW_NLW approach) or the meta-analysis (META_NLW) for the discovery GWAS, we observed little difference in performance in CCLS NLW and LAT subjects (66.7% in NLW vs. 65.2% in LAT by AUC for the NLW_NLW strategy; 71.0% in NLW vs. 70.0% in LAT by AUC for the META_NLW strategy; Tables 1 and S3). We also did not observe differences in the PRS predictive performance across strata of LAT individuals by estimated European ancestry proportions (Table 1). If there had been overt transferability issues, we would expect that strata with the highest European ancestry would have higher prediction accuracy compared with strata with lower European ancestry. It remains unclear whether the similar performance across populations is driven by representation in GWAS, European ancestry admixture in LAT samples, or sufficiently shared genetic architecture between populations for ALL that is minimally impacted by LD differences.

One possible explanation for comparable PRS efficacy between LAT and NLW is shared genetic architecture. That is, ALL may follow a more oligogenic architecture where several large effect causal loci exist on top of a polygenic background of smaller effect causal loci. Our previous study2 had demonstrated that the genetic correlation between NLW and LAT is relatively high (rG = 0.714 ± 0.13), though could be different from 1 (p = 0.014). Here, we have shown that an LDPred2 model for PRS assuming infinitesimal causal loci drastically underperforms compared with one without this assumption (Figure S1 and Table S3). Combining these two observations, we speculate that the disease architecture for ALL may be driven by a few large effect loci that are shared across ancestries. The lower genetic correlation between populations may then be driven by significant differences in the polygenic background, or by other yet-undiscovered population-enriched alleles. But as these loci may have smaller effects, PRS efficacy, and hence transferability, could be driven largely by the main effect loci, at least within the resolution of the sample size of the current validation cohort (i.e., CCLS). Future studies, particularly if focused on a single ethnic group such as NLW, that continue to elucidate the polygenic background of the ALL architecture may then both improve the accuracy of PRS model performance as well as exacerbating the loss of efficacy across populations that we are not currently able to detect. For this reason we would advocate for greater inclusion in GWAS representation despite currently observing little evidence in the loss of transferability in PRS model. With regards to the LAT population which has higher risk for ALL, increasing sample sizes will likely help improve PRS models in this population; indeed, using a smaller GWAS solely from LAT already substantially improved PRS prediction efficacy in out-of-sample LAT cohort (Figure 1) and further discovery of LAT-enriched alleles will improve PRS models for this population. More generally, diverse ancestries in multi-ancestry GWAS can also help with better fine-mapping of causal loci, which would improve both efficacy and transferability of PRS models.15,17

A number of limitations in our study exist, which in turn provide guidance for future designs to propose and evaluate PRS models for ALL. First, efforts to continue to increase the sample size in GWAS is imperative. We have shown that genomic PRS using LDPred2 outperforms that based on just genome-wide significant loci (using the same GWAS, i.e., P + T models with p value threshold of 5e-8). But this model currently does not outperform one simply based on aggregating all known loci from the literature, effectively combining information across multiple independent GWAS datasets. Therefore, an aggregation of available GWAS through a consortium effort should provide the ideal dataset to train better genomic PRS models. Efforts like the Childhood Leukemia International Consortium (CLIC)38 should provide the best resources in the foreseeable future. CLIC meta- or mega-analysis will include around 20,000 cases and 160,000 multi-ancestry ALL cases and controls. Given a larger dataset, we can ensure greater sample size for both the testing and validation dataset to iteratively assess the transferability of the PRS models. This is particularly important for Latino populations, given the known heterogeneity in ancestry compositions and fine-scale structure of Latinos across the United States and Latin America.39,40 Second, this consortium effort will also begin to incorporate ancestrally diverse populations that we could not properly evaluate in this study, namely the African-ancestry and East Asian-ancestry populations. Preliminarily, the consortium will contain ∼3,400 African Americans (∼700 cases) and ∼11,000 East Asians (∼1,300 cases), which could allow similar investigations of PRS presented here to be conducted in these populations. Future aggregations of other ancestrally diverse cohorts will also be needed.

Third, given the suspected oligogenic architecture of ALL, alternative PRS strategies that incorporate information from the distribution of effect sizes may also further improve the performance from a methodological standpoint. While LDPred2 controls somewhat the proportion of the genome underlying a trait through optimization of the ρ parameter, its prior is ultimately a “spike-and-slab” prior. A more direct modeling of the distribution of effect sizes, on top of a polygenic background, may prove to be a better model for ALL. Methods following these types of models are emerging (e.g., see Spence et al.41), and will likely become more mature in the near future. But even without a unified framework to model effect size distributions, a simpler approach42 that combines weighted PRS could also be more effective. In this case, one score would be derived from sections of the genome known to be associated with ALL that may also include multiple secondary but independent causal variants, and the other score could be derived from LDPred2 or similar approaches from the rest of the genome. The weights between these two scores can then be optimized in the training dataset as an additional parameter to derive a score that may outperform any of the existing models evaluated here.

PRSs are intended to be robust prediction tools that would be utilized in research and clinical settings. In research settings, PRSs would be applicable in defining the attributable fraction of leukemia risk derived from common genetic variation when examining other risks—either from low frequency genetic alleles or environmental factors. In addition, inclusion of PRSs in environmental epidemiological studies of ALL to account for the contribution of germline variation may improve the power and specificity of environment-ALL risk models, as well as increase the power to detect gene-by-environment interactions.43 Ultimately, PRSs may be incorporated with additional risk prediction tools such as markers of early leukemia-promoting mutations on a population scale in neonatal screening efforts where interventions are available. Many barriers exist for the deployment of PRSs in clinical settings. While it may be premature to anticipate the clinical applications, any deployment of PRS will require accurate tools across all ancestral/ethnic groups, particularly for the Latino population who harbor the greatest risk of ALL. Our study represents one of the first approaches toward this goal.

Data and code availability

The analysis pipeline used for all analysis in this manuscript is documented on github at http://www.github.com/syjeoneli/grps_v2.0. CCRLP and CCLS genetic data used in this manuscript are derived from the California Biobank. We respectfully are unable to share raw, individual genetic data freely with other investigators since the samples and the data are the property of the State of California. Should we be contacted by other investigators who would like to use the data, we will direct them to the California Department of Public Health Institutional Review Board to establish their own approved protocol to utilize the data, which can then be shared peer-to-peer. The State has provided guidance on data sharing noted in the following statement: "California has determined that researchers requesting the use of California Biobank biospecimens for their studies will need to seek an exemption from NIH or other granting or funder requirements regarding the uploading of study results into an external bank or repository (including into the NIH dbGaP or other bank or repository). This applies to any uploading of genomic data and/or sharing of these biospecimens or individual data derived from these biospecimens. Such activities have been determined to violate the statutory scheme at California Health and Safety Code Section 124980 (j), 124991 (b), (g), (h) and 103850 (a) and (d), which protect the confidential nature of biospecimens and individual data derived from biospecimens. Investigators may agree to share aggregate data on SNP frequency and their associated p values with other investigators and may upload such frequencies into repositories including the NIH dbGaP repository providing: a) the denominator from which the data is derived includes no fewer than 20,000 individuals; b) no cell count is for <5 individuals; and c) no correlations or linkage probabilities between SNPs are provided." Since our dataset is derived from fewer than 20,000 subjects, we are not able to upload the data to dbGAP or another repository. GERA and COG datasets are not derived from the California Biobank and are available on dbGAP. The accession numbers for COG and GERA are phs000638.v1.p1 and phs000788.v1.p2, respectively.

Acknowledgments

This work was supported by research grants from the National Institutes of Health (R01CA155461, R01CA175737, R01ES009137, P42ES004705, P01ES018172, P42ES0470518, R24ES028524, and R01CA262263) and the Environmental Protection Agency (RD83451101), United States. C.W.K.C. and S.J. were supported by R35GM142783 from the National Institute of General Medical Sciences (NIGMS). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health and the EPA. The collection of cancer incidence data used in this study was supported by the California Department of Public Health as part of the statewide cancer reporting program mandated by California Health and Safety Code Section 103885; the National Cancer Institute’s Surveillance, Epidemiology and End Results Program under contract HHSN261201000140C awarded to the Cancer Prevention Institute of California, contract HHSN261201000035C awarded to the University of Southern California, and contract HHSN261201000034C awarded to the Public Health Institute; and the Centers for Disease Control and Prevention’s National Program of Cancer Registries, under agreement U58DP003862-01 awarded to the California Department of Public Health. The biospecimens and/or data used in this study were obtained from the California Biobank Program (SIS request #26 and #1380). The California Department of Public Health is not responsible for the results or conclusions drawn by the authors of this publication. This study makes use of data generated by the Wellcome Trust Case–Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113 and 085475. For recruitment of subjects enrolled in the CCLS replication set, the authors gratefully acknowledge the clinical investigators at the following collaborating hospitals: University of California Davis Medical Center (Dr. Jonathan Ducore), University of California, San Francisco (Drs. Mignon Loh and Katherine Matthay), Children’s Hospital of Central California (Dr. Vonda Crouse), Lucile Packard Children’s Hospital (Dr. Gary Dahl), Children’s Hospital Oakland (Dr. James Feusner), Kaiser Permanente Roseville (formerly Sacramento) (Drs. Kent Jolly and Vincent Kiley), Kaiser Permanente Santa Clara (Drs. Carolyn Russo, Alan Wong, and Denah Taggart), Kaiser Permanente San Francisco (Dr. Kenneth Leung), and Kaiser Permanente Oakland (Drs. Daniel Kronish and Stacy Month). The authors additionally thank the families for their participation in the California Childhood Leukemia Study (formerly known as the Northern California Childhood Leukemia Study). Finally, the authors acknowledge the Center for Advanced Research Computing (CARC; https://carc.usc.edu) at the University of Southern California for providing computing resources that have contributed to the research results reported within this publication.

Declaration of interests

The authors declare no competing interests.

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.xhgg.2023.100239.

Web resources

PRS analysis pipeline: http://www.github.com/syjeoneli/grps_v2.0.

Wellcome Trust Case-Control Consortium: http://www.wtccc.org.uk.

Supplemental information

Document S1. Figures S1 and Tables S1–S4
mmc1.pdf (206.6KB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (2.2MB, pdf)

References

  • 1.Siegel D.A., Henley S.J., Li J., Pollack L.A., Van Dyne E.A., White A. Rates and Trends of Pediatric Acute Lymphoblastic Leukemia — United States, 2001–2014. MMWR Morb. Mortal. Wkly. Rep. 2017;66:950–954. doi: 10.15585/mmwr.mm6636a3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Jeon S., de Smith A.J., Li S., Chen M., Chan T.F., Muskens I.S., Morimoto L.M., DeWan A.T., Mancuso N., Metayer C., et al. Genome-wide trans-ethnic meta-analysis identifies novel susceptibility loci for childhood acute lymphoblastic leukemia. Leukemia. 2022;36:865–868. doi: 10.1038/s41375-021-01465-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Vijayakrishnan J., Kumar R., Henrion M.Y.R., Moorman A.V., Rachakonda P.S., Hosen I., da Silva Filho M.I., Holroyd A., Dobbins S.E., Koehler R., et al. A genome-wide association study identifies risk loci for childhood acute lymphoblastic leukemia at 10q26.13 and 12q23.1. Leukemia. 2017;31:573–579. doi: 10.1038/leu.2016.271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wiemels J.L., Walsh K.M., de Smith A.J., Metayer C., Gonseth S., Hansen H.M., Francis S.S., Ojha J., Smirnov I., Barcellos L., et al. GWAS in childhood acute lymphoblastic leukemia reveals novel genetic associations at chromosomes 17q12 and 8q24.21. Nat. Commun. 2018;9:286. doi: 10.1038/s41467-017-02596-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Papaemmanuil E., Hosking F.J., Vijayakrishnan J., Price A., Olver B., Sheridan E., Kinsey S.E., Lightfoot T., Roman E., Irving J.A.E., et al. Loci on 7p12.2, 10q21.2 and 14q11.2 are associated with risk of childhood acute lymphoblastic leukemia. Nat. Genet. 2009;41:1006–1010. doi: 10.1038/ng.430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Perez-Andreu V., Roberts K.G., Harvey R.C., Yang W., Cheng C., Pei D., Xu H., Gastier-Foster J., E S., Lim J.Y.-S., et al. Inherited GATA3 variants are associated with Ph-like childhood acute lymphoblastic leukemia and risk of relapse. Nat. Genet. 2013;45:1494–1498. doi: 10.1038/ng.2803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Treviño L.R., Yang W., French D., Hunger S.P., Carroll W.L., Devidas M., Willman C., Neale G., Downing J., Raimondi S.C., et al. Germline genomic variants associated with childhood acute lymphoblastic leukemia. Nat. Genet. 2009;41:1001–1005. doi: 10.1038/ng.432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Xu H., Yang W., Perez-Andreu V., Devidas M., Fan Y., Cheng C., Pei D., Scheet P., Burchard E.G., Eng C., et al. Novel susceptibility variants at 10p12.31-12.2 for childhood acute lymphoblastic leukemia in ethnically diverse populations. J. Natl. Cancer Inst. 2013;105:733–742. doi: 10.1093/jnci/djt042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Vijayakrishnan J., Qian M., Studd J.B., Yang W., Kinnersley B., Law P.J., Broderick P., Raetz E.A., Allan J., Pui C.-H., et al. Identification of four novel associations for B-cell acute lymphoblastic leukaemia risk. Nat. Commun. 2019;10:5348. doi: 10.1038/s41467-019-13069-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.de Smith A.J., Walsh K.M., Morimoto L.M., Francis S.S., Hansen H.M., Jeon S., Gonseth S., Chen M., Sun H., Luna-Fineman S., et al. Heritable variation at the chromosome 21 gene ERG is associated with acute lymphoblastic leukemia risk in children with and without Down syndrome. Leukemia. 2019;33:2746–2751. doi: 10.1038/s41375-019-0514-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Vijayakrishnan J., Henrion M., Moorman A.V., Fiege B., Kumar R., da Silva Filho M.I., Holroyd A., Koehler R., Thomsen H., Irving J.A., et al. The 9p21.3 risk of childhood acute lymphoblastic leukaemia is explained by a rare high-impact variant in CDKN2A. Sci. Rep. 2015;5:15065. doi: 10.1038/srep15065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.de Smith A.J., Walsh K.M., Francis S.S., Zhang C., Hansen H.M., Smirnov I., Morimoto L., Whitehead T.P., Kang A., Shao X., et al. BMI1enhancer polymorphism underlies chromosome 10p12.31 association with childhood acute lymphoblastic leukemia: BMI 1 enhancer polymorphism in ALL. Int. J. Cancer. 2018;143:2647–2658. doi: 10.1002/ijc.31622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.International Schizophrenia Consortium. Purcell S.M., Wray N.R., Stone J.L., O’Donovan M.C., O'Donovan M.C., Sullivan P.F., Sklar P., Stone J.L., Sullivan P.F., et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Santoro M.L., Ota V., de Jong S., Noto C., Spindola L.M., Talarico F., Gouvea E., Lee S.H., Moretti P., Curtis C., et al. Polygenic risk score analyses of symptoms and treatment response in an antipsychotic-naive first episode of psychosis cohort. Transl. Psychiatry. 2018;8:174–178. doi: 10.1038/s41398-018-0230-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Martin A.R., Kanai M., Kamatani Y., Okada Y., Neale B.M., Daly M.J. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. doi: 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Popejoy A.B., Fullerton S.M. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fatumo S., Chikowore T., Choudhury A., Ayub M., Martin A.R., Kuchenbaecker K. A roadmap to increase diversity in genomic studies. Nat. Med. 2022;28:243–250. doi: 10.1038/s41591-021-01672-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Martin A.R., Gignoux C.R., Walters R.K., Wojcik G.L., Neale B.M., Gravel S., Daly M.J., Bustamante C.D., Kenny E.E. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am. J. Hum. Genet. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Barrington-Trimis J.L., Cockburn M., Metayer C., Gauderman W.J., Wiemels J., McKean-Cowdin R. Trends in childhood leukemia incidence over two decades from 1992 to 2013. Int. J. Cancer. 2017;140:1000–1008. doi: 10.1002/ijc.30487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Barrington-Trimis J.L., Cockburn M., Metayer C., Gauderman W.J., Wiemels J., McKean-Cowdin R. Rising rates of acute lymphoblastic leukemia in Hispanic children: trends in incidence from 1992 to 2011. Blood. 2015;125:3033–3034. doi: 10.1182/blood-2015-03-634006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bhatia S., Sather H.N., Heerema N.A., Trigg M.E., Gaynon P.S., Robison L.L. Racial and ethnic differences in survival of children with acute lymphoblastic leukemia. Blood. 2002;100:1957–1964. doi: 10.1182/blood-2002-02-0395. [DOI] [PubMed] [Google Scholar]
  • 22.Kadan-Lottick N.S., Ness K.K., Bhatia S., Gurney J.G. Survival Variability by Race and Ethnicity in Childhood Acute Lymphoblastic Leukemia. JAMA. 2003;290:2008–2014. doi: 10.1001/jama.290.15.2008. [DOI] [PubMed] [Google Scholar]
  • 23.Linabery A.M., Ross J.A. ) Trends in childhood cancer incidence in the U.S. 2008;112:416–432. doi: 10.1002/cncr.23169. [DOI] [PubMed] [Google Scholar]
  • 24.Vijayakrishnan J., Studd J., Broderick P., Kinnersley B., Holroyd A., Law P.J., Kumar R., Allan J.M., Harrison C.J., Moorman A.V., et al. Genome-wide association study identifies susceptibility loci for B-cell childhood acute lymphoblastic leukemia. Nat. Commun. 2018;9:1340. doi: 10.1038/s41467-018-03178-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Metayer C., Zhang L., Wiemels J.L., Bartley K., Schiffman J., Ma X., Aldrich M.C., Chang J.S., Selvin S., Fu C.H., et al. Tobacco smoke exposure and the risk of childhood acute lymphoblastic and myeloid leukemias by cytogenetic subtype. Cancer Epidemiol. Biomarkers Prev. 2013;22:1600–1611. doi: 10.1158/1055-9965.EPI-13-0350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Choi S.W., Mak T.S.-H., O’Reilly P.F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 2020;15:2759–2772. doi: 10.1038/s41596-020-0353-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Khan A., Turchin M.C., Patki A., Srinivasasainagendra V., Shang N., Nadukuru R., Jones A.C., Malolepsza E., Dikilitas O., Kullo I.J., et al. Genome-wide polygenic score to predict chronic kidney disease across ancestries. Nat. Med. 2022;28:1412–1420. doi: 10.1038/s41591-022-01869-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Willer C.J., Li Y., Abecasis G.R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26:2190–2191. doi: 10.1093/bioinformatics/btq340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Banda Y., Kvale M.N., Hoffmann T.J., Hesselson S.E., Ranatunga D., Tang H., Sabatti C., Croen L.A., Dispensa B.P., Henderson M., et al. Characterizing Race/Ethnicity and Genetic Ancestry for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort. Genetics. 2015;200:1285–1295. doi: 10.1534/genetics.115.178616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Privé F., Arbel J., Vilhjálmsson B.J. LDpred2: better, faster, stronger. Bioinformatics. 2020;36:5424–5431. doi: 10.1093/bioinformatics/btaa1029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Vilhjálmsson B.J., Yang J., Finucane H.K., Gusev A., Lindström S., Ripke S., Genovese G., Loh P.-R., Bhatia G., Do R., et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am. J. Hum. Genet. 2015;97:576–592. doi: 10.1016/j.ajhg.2015.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kim M.S., Naidoo D., Hazra U., Quiver M.H., Chen W.C., Simonti C.N., Kachambwa P., Harlemon M., Agalliu I., Baichoo S., et al. Testing the generalizability of ancestry-specific polygenic risk scores to predict prostate cancer in sub-Saharan Africa. Genome Biol. 2022;23:194. doi: 10.1186/s13059-022-02766-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J.-C., Müller M. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Maples B.K., Gravel S., Kenny E.E., Bustamante C.D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 2013;93:278–288. doi: 10.1016/j.ajhg.2013.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Alföldi J., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Li S., Chiang C.W.K., Myint S.S., Arroyo K., Chan T.F., Morimoto L., Metayer C., de Smith A.J., Walsh K.M., Wiemels J.L. Localized variation in ancestral admixture identifies pilocytic astrocytoma risk loci among Latino children. PLoS Genet. 2022;18:e1010388. doi: 10.1371/journal.pgen.1010388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Metayer C., Milne E., Clavel J., Infante-Rivard C., Petridou E., Taylor M., Schüz J., Spector L.G., Dockerty J.D., Magnani C., et al. The Childhood Leukemia International Consortium. Cancer Epidemiol. 2013;37:336–347. doi: 10.1016/j.canep.2012.12.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Bryc K., Durand E.Y., Macpherson J.M., Reich D., Mountain J.L. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am. J. Hum. Genet. 2015;96:37–53. doi: 10.1016/j.ajhg.2014.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Gravel S., Zakharia F., Moreno-Estrada A., Byrnes J.K., Muzzio M., Rodriguez-Flores J.L., Kenny E.E., Gignoux C.R., Maples B.K., Guiblet W., et al. Reconstructing Native American Migrations from Whole-Genome and Whole-Exome Data. PLoS Genet. 2013;9:e1004023. doi: 10.1371/journal.pgen.1004023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Spence J.P., Sinnott-Armstrong N., Assimes T.L., Pritchard J.K. A flexible modeling and inference framework for estimating variant effect sizes from GWAS summary statistics. Genomics. 2022 doi: 10.1101/2022.04.18.488696. [DOI] [Google Scholar]
  • 42.Márquez-Luna C., Loh P.-R., South Asian Type 2 Diabetes (SAT2D) Consortium. SIGMA Type 2 Diabetes Consortium. Price A.L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 2017;41:811–823. doi: 10.1002/gepi.22083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Zhong C., Li S., Arroyo K., Morimoto L., de Smith A.J., Metayer C., Ma X., Kogan S.C., Gauderman J.W. Gene-environment analyses reveal novel genetic candidates with prenatal tobacco exposure in relation to risk for childhood acute lymphoblastic leukemia. Cancer Epidemiol. Biomarkers Prev. In press. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1 and Tables S1–S4
mmc1.pdf (206.6KB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (2.2MB, pdf)

Data Availability Statement

The analysis pipeline used for all analysis in this manuscript is documented on github at http://www.github.com/syjeoneli/grps_v2.0. CCRLP and CCLS genetic data used in this manuscript are derived from the California Biobank. We respectfully are unable to share raw, individual genetic data freely with other investigators since the samples and the data are the property of the State of California. Should we be contacted by other investigators who would like to use the data, we will direct them to the California Department of Public Health Institutional Review Board to establish their own approved protocol to utilize the data, which can then be shared peer-to-peer. The State has provided guidance on data sharing noted in the following statement: "California has determined that researchers requesting the use of California Biobank biospecimens for their studies will need to seek an exemption from NIH or other granting or funder requirements regarding the uploading of study results into an external bank or repository (including into the NIH dbGaP or other bank or repository). This applies to any uploading of genomic data and/or sharing of these biospecimens or individual data derived from these biospecimens. Such activities have been determined to violate the statutory scheme at California Health and Safety Code Section 124980 (j), 124991 (b), (g), (h) and 103850 (a) and (d), which protect the confidential nature of biospecimens and individual data derived from biospecimens. Investigators may agree to share aggregate data on SNP frequency and their associated p values with other investigators and may upload such frequencies into repositories including the NIH dbGaP repository providing: a) the denominator from which the data is derived includes no fewer than 20,000 individuals; b) no cell count is for <5 individuals; and c) no correlations or linkage probabilities between SNPs are provided." Since our dataset is derived from fewer than 20,000 subjects, we are not able to upload the data to dbGAP or another repository. GERA and COG datasets are not derived from the California Biobank and are available on dbGAP. The accession numbers for COG and GERA are phs000638.v1.p1 and phs000788.v1.p2, respectively.


Articles from Human Genetics and Genomics Advances are provided here courtesy of Elsevier

RESOURCES