Abstract
Both acquired mutations and germline genetic variation are known risk factors for chronic lymphocytic leukemia (CLL). Joint characterization of germline, acquired, and clinical risk has the potential to improve CLL risk prediction. Here, we investigated whether inclusion of a CLL-associated polygenic score (PGS) and two common types of clonal hematopoiesis (CH), autosomal mosaic chromosomal alterations (mCAs) and CH of indeterminate potential (CHIP), could improve CLL risk stratification in 436,784 participants in the UK Biobank and a replication set of 35,382 participants in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Individual mCAs on chromosomes 11, 12, 23, 14, and 22, as well as CHIP mutations in known lymphoid driver genes, were strongly associated with CLL risk. Integrative models that included sex, age, smoking status, blood cell traits, genetic similarity, CLL PGS, autosomal mCAs, and CHIP had the greatest discriminative ability with predictive utility waning five years after measurement of CH. Sensitivity analyses removing individuals with abnormal blood cell counts and CH commonly observed in CLL showed persistent increased discriminative ability. Evaluating cumulative absolute risk, the CLL PGS and CH had improved ability to stratify CLL cases into higher risk categories and controls into lower risk categories. Overall, this analysis details the enhanced ability to identify individuals at high risk of CLL when integrating germline and somatic data derived from peripheral blood.
Statement of Significance
Joint consideration of well-characterized clinical characteristics with germline genetic variation and somatic mutations can enable chronic lymphocytic leukemia risk stratification to support clinical decision-making and early detection.
Introduction
Chronic lymphocytic leukemia (CLL) is the most common lymphoid leukemia, accounting for one-quarter of leukemias.(1) Risk factors for CLL include age and male sex, and incidence is higher in European and North American populations.(1,2) Genetics are also strongly implicated with a reported eightfold increased risk of CLL observed in relatives of CLL patients.(3) Multiple genome-wide association studies (GWAS) have identified common inherited genetic risk loci,(4-11) with the most recent meta-analysis reporting 45 susceptibility variants accounting for 34% of the estimated heritability of CLL due to common variation.(10)
Some forms of clonal hematopoiesis (CH) have emerged as risk factors for CLL, although not all studies attempted to remove individuals with existing monoclonal B-cell lymphocytosis (MBL) or CLL clones. CH can be driven by both large structural chromosomal aberrations (e.g., chromosomal gains, losses, and copy neutral loss of heterozygosity (CNLOH)) known as mosaic chromosomal alterations (mCAs) and somatic point mutations known as CH of indeterminate potential (CHIP).(12) Multiple recurrent copy number changes have been identified in 80% of CLL cases with at least 1 of 4 common chromosomal alterations: deletions on 13q14 (50%), 11q (10-25%), or 17p (5-8%), or a gain of chromosome 12 (10-20%).(10,12-17) Whole exome sequencing studies (WES) have identified recurrent CLL driver mutations in NOTCH1, MYD88, TP53, ATM, SF3B1, FBXW7, POT1, CHD2, RPS15, IKZF3, ZNF292, ZMYM3, ARID1A, and PTPN11.(14,16,18-23)
In this study, we leveraged data from two large prospective studies, the UK Biobank (UKBB) and the Prostate, Lung, Colorectal and Ovarian Screening Trial (PLCO), to investigate the predictive value of integrating germline and somatic information with commonly measured clinical characteristics to improve CLL risk stratification of healthy individuals in the general population. We examined which CH events are most strongly associated with CLL, evaluated potential mediation of a CLL-associated polygenic score (PGS) by blood cell traits and CH types, and estimated the absolute risk of CLL among study participants based on increasingly integrative predictive models. Our findings show an improved risk stratification for CLL and underscore the strong contribution of both germline and somatic components to the genetic etiology of CLL.
Materials and Methods
Study populations
Available data from the UKBB (RRID:SCR_012815) and the PLCO were used to investigate the association between CH and incident CLL. Briefly, the UKBB is a large prospective study based in the United Kingdom that collected blood samples for genotyping, medical history, and environmental exposures from study participants between 2006 and 2010.(24) A total of 436,784 individuals with available genotyping data from blood-derived DNA were included in this study. Patients with self-reported or diagnosed cancer prior to enrollment in UKBB were excluded (Supplemental Figure 1). Cancer information was extracted from inpatient records and available cancer registry data. Prior cancers were defined as occurring before study enrollment using codes: C00-C14: Malignant neoplasms of lip, oral cavity and pharynx, C15-C26: Malignant neoplasms of digestive organs, C30-C39: Malignant neoplasms of respiratory and intrathoracic organs, C40-C41: Malignant neoplasms of bone and articular cartilage, C43-C44: Melanoma and other malignant neoplasms of skin, C45-C49: Malignant neoplasms of mesothelial and soft tissue, C50-C50: Malignant neoplasm of breast, C51-C58: Malignant neoplasms of female genital organs, C60-C63: Malignant neoplasms of male genital organs, C64-C68: Malignant neoplasms of urinary tract, C69-C72: Malignant neoplasms of eye, brain and other parts of central nervous system, C73-C75: Malignant neoplasms of thyroid and other endocrine glands, C76-C80: Malignant neoplasms of ill-defined, secondary and unspecified sites, C81-C96: Malignant neoplasms, stated or presumed to be primary, of lymphoid, hematopoietic and related tissue, C97-C97: Malignant neoplasms of independent (primary) multiple sites, D37-D48: Neoplasms of uncertain or unknown behavior. We further removed individuals who self-reported as being diagnosed with cancer (UKBB data field ‘20001’) before enrollment or those who withdrew from the study.
PLCO is a prospective cohort based on long-term follow-up of a randomized multi-center trial designed to understand the effects of screening on cancer-related mortality and secondary endpoints.(25) A total of 35,382 individuals with available genotyping information from blood-derived DNA, with follow-up time, and without prevalent cancer were included (Supplemental Figure 2). Our analysis was restricted to those with genetic similarity to European reference samples (see Study Variables section below) due to both small numbers of other populations and the batch imputation process implemented in PLCO as detailed elsewhere.(26)
Study variables
Incident CLL diagnosis was defined as any diagnosis occurring after study enrollment using ICD-10 code ‘C91.1’ in both UKBB and PLCO. Additional covariate information was obtained from questionnaire results at the time of enrollment in each study, including age, sex, and smoking status (never, former, current smoker). Genetic similarity to a reference population was also inferred for each participant using GrafPop.(27) Briefly, GrafPop calculates genetic distances from each subject to five reference samples from dbGap, defined using study-reported population labels (1: White, European, European American; 2: Black, African, African American, Ghana, Yoruba; 3: Asian, East Asian, Chinese, Japanese; 4: Asian Indian, Pakistani, 5: Mexican, Latino) and estimates subject similarity to these reference samples based on these distances.(27) All covariate definitions were consistent across UKBB and PLCO. Blood cell trait assays were performed on blood samples from UKBB participants. We used white blood cell count (WBC) and monocyte and neutrophil percentages because they were available in UKBB for use as covariates and sensitivity analyses. All available UKBB blood count parameters were not added into final models due to high correlations.(28) Blood cell trait data was not available for PLCO participants.
mCAs were detected in both cohorts using intensity values and haplotype information from SNP genotyping data obtained by hybridizing blood-derived DNA to SNP microarrays (UKBB: Affymetrix UK BiLEVE Axiom and UK Biobank Axiom arrays; PLCO: Illumina Global Screening Array). mCAs were detected using MoChA v2022-1-12 (https://github.com/freeseek/mochawdl) applying the developer’s recommended filters: a call rate ≥ 97% and a phased B allele frequency (BAF) auto correlation across consecutive phased heterozygous sites in autosomes ≤ 3%. The methods and algorithm for MoChA have been described in detail elsewhere.(29,30) Briefly, MoChA converts genotyping intensity values into log2 R ratio (LRR) and BAF and employs a phase-based detection algorithm to capture allelic imbalances across segments of the genome. The method enables highly sensitive detection of mCAs by incorporation of long-range phase information, including calling copy number losses, gains, or loss of heterozygosity.
CHIP somatic mutations were detected in UKBB by utilizing whole exome sequencing (WES) data to perform variant calling using Mutect2 v4.2.1.0 and VarDictJava v1.6.0. Variants supported by both variant callers were retained, and variants were required to have a variant allele frequency (VAF) ≥2% with at least three supporting reads including one from both the forward and reverse strands. We created an internal panel of normal (PoN) using 50 individuals under 41 years of age from the UKBB without any established CH hotspot mutations. Variants that were found with a VAF greater than 2% in two or more samples from the PoN were removed. For every retained variant, we utilized the PoN to empirically estimate the sequencing noise for each given position by comparing the alternate and reference allele counts within the individual sample against the alternate and reference counts within the pooled PoN samples using Fisher’s exact test. We adjusted for multiple hypothesis testing using a Bonferroni correction based on the size of the WES capture region, including splice site acceptor and donor sites resulting in a p-value threshold of 1.3×10−9. Variants were also removed that were 1) recurrent in greater than 1% of UKBB samples yet not been previously reported in large-scale CH studies,(31,32) 2) reported in the gnomAD database [exome v2.1.1 and genome v3.1.2] with a population frequency above 5×10−4 or a maximum sub-population allele frequency above 5×10−3, 3) variants with a VAF above 35% VAF unless it was a clear CH hotspot and 4) variants with a VAF of >25% that were recurrent in the UKBB with a median VAF of >35% and which were not previously found in large CH studies.(31,32) Loss of function (LoF) variants in DNTM3A, TET2, ASXL1 and PPM1D were not subject to exclusion criteria 3 and 4 as LoF in these genes are not commonly found as germline. After calling and filtering, functional variant annotation and putative driver classification was performed to provide comprehensive CHIP variant detection in UKBB. As WES data is not available for PLCO, no CHIP calls were available for analysis.
CHIP genes were classified using a multi-step integrative approach incorporating publicly available databases and published resources. Genes mutated in at least ≥5% of lymphoid or myeloid malignancies using mutation frequency data from myeloid and lymphoid cancer studies available on cBioPortal (https://www.cbioportal.org/) were classified as myeloid or lymphoid. Genes mutated in less than 5% of cases were annotated using IntOGen cancer driver data (https://www.intogen.org/) based on cancer type, with lymphoid-associated cancers (e.g., ALL, CLL) and myeloid-associated cancers (e.g., AML, MDS) informing the classification process. Remaining genes not meeting the above criteria were annotated as per published literature on myeloid and lymphoid distinction.(15) Genes lacking evidence of driving hematologic cancer based on these sources were categorized as “other cancer.”
Statistical Analysis
Demographic characteristics were described by CLL status (case/non-case) in both UKBB and PLCO. Cox proportional hazards regression assessed the association between incident CLL and CH. To identify the types of CH that confer the largest effects on incident CLL, we conducted analyses by subgroups of mCAs and CHIP. mCAs were evaluated in the following way: any autosomal mCA, autosomal mCA by individual chromosome, and in 5 Mb regions across the chromosome. Windows of 5 Mb were created starting at the location of the first genotyped variant on each chromosome. We did not investigate the association with the presence of an mCA in any 5 Mb bin and incident CLL risk that contained less than 10 mCAs in CLL cases across the 5 Mb windows. The impact of CHIP on incident CLL was evaluated by aggregating mutations into any CHIP, myeloid CHIP, lymphoid CHIP, and CHIP relevant to both lymphoid and myeloid malignancies as described above. All analyses were conducted initially in UKBB with PLCO serving as a replication set when relevant data were available in PLCO (e.g., blood cell count data was only available in UKBB). Multivariable models adjusted for age, age-squared (to model non-linear effects of age), sex, smoking status (never, former, current), and genetic similarity to reference samples. For UKBB, follow-up began at study enrollment and continued until CLL diagnosis, date of death, last contact, or the latest cancer registry or inpatient record available in UKBB (March 31, 2017). For UKBB, follow-up time was calculated using the R package UKBBcleanR(33) version 0.1.2 that we developed for this study to facilitate calculations for time-to-event analysis and prevalent cancer removal (see: https://github.com/machiela-lab/UKBBcleanR for more documentation). For PLCO, follow-up began at time of genotype sample collection and continued until CLL diagnosis, date of death, last contact, or December 31, 2018.
The CLL PGS was constructed using all germline variants previously associated at genome-wide significance with CLL and their effect sizes as weights (Supplemental Table 1), using PLINK 1.9 (RRID:SCR_001757) with the “--score sum” flag for each participant in our datasets.(10) Variants had imputation R2 values greater than 0.3. We used ARCHIE, a sparse canonical correlation-based approach, to identify partitioned PGS potentially connected to distinct intermediate mechanisms.(34) Given a set of disease-related variants constituting the PGS of disease, and their association summary statistics (effect, standard errors and p-values) with a spectrum of molecular phenotypes (gene expressions, proteins, and others), ARCHIE identifies subsets of variants that are strongly correlated with a subset of molecular targets. The identified/selected subsets of variants can then be used to construct partitioned PGS correspondingly. Here, we extracted trans-pQTL summary statistics from the publicly available UK Biobank Proteomics data for the CLL-related proteins by removing all the proteins within a 5Mb window of any variant in the PGS. Results from ARCHIE on this data can thus be interpreted as subsets of variants or partitioned PGS of CLLs that are strongly trans-associated with the corresponding selected proteins.
Mediation analyses were conducted to quantify the effect mediated by types of CH on the association between the CLL PGS and CLL. In addition to the overall CLL PGS, each PGS component from the ARCHIE analyses was individually evaluated in mediation analyses to determine if they were differentially mediated by CH. Mediation analysis was conducted using the R package regmedint(35) version 1.0.0 based on the causal mediation framework developed by Valeri & Vanderweele (2013).(36) Mediation analysis models adjusted for age, age-squared, genetic similarity (as a proportion), smoking status, WBC, monocyte percentage, and neutrophil percentage. Specifically, we used Cox regression models to model the relationship between CLL risk and CLL PGS, CH, and other covariates, using the flag yreg = “survCox." Depending on the variable, logistic or linear regression was used to model its association with CLL PGS and other covariates, using the flag mreg = “linear” or “logistic.” Models also allowed for the potential interactions between the direct and indirect effects with the flag “interaction = TRUE".
To test the discriminative ability of autosomal mCAs, we developed sequential regression models to examine risk stratification of incident CLL. The baseline model consisted of standard demographic information (age, age-squared, sex, smoking status, and genetic similarity as a proportion). The CLL PGS was added next. Different combinations of autosomal mCAs were then added to the model to estimate the predictive value of these features. Autosomal mCAs were classified as any autosomal mCA (status: yes/no), maximum cellular fraction, and chromosome-specific mCA regions (status: yes/no). Additional models in UKBB were run including blood cell traits (WBC, monocyte percentage, and neutrophil percentage), CHIP mutations (status: yes/no), and CHIP type (myeloid, lymphoid, both). Blood cell traits and CHIP were not available in PLCO.
The performance of each model was assessed using area under the receiver operating characteristic curve (AUC) derived from the R package ‘survAUC’(37) version 1.2.0. UKBB data was subset to include individuals with available genotype imputation data and blood cell counts (N= 422,029). UKBB was split (90%/10%) into training and test sets, and PLCO served as an external validation set. AUC calculations were performed to test model discriminative ability for both 5-year and 10-year CLL risk. 95% Confidence intervals (95%CI) were estimated by performing 1000 iterations of testing/training resampling and using the 2.5% and 97.5% percentiles of the AUC distributions. Sensitivity analyses were performed to ensure the robustness of the results. First, UKBB was restricted to have the same age distribution and genetic similarity >80% to European reference samples to match PLCO. We also constructed ROC curves using 10-year risk models for visualization of model performance, although these models were not time-dependent and do not represent the survival AUCs. Multiple analyses were performed to remove potential preclinical disease. First, we performed AUC calculations for 5 to 10-year risk, removing all individuals diagnosed with CLL in the first five years. We also performed sensitivity analyses of AUC models at a 50%/50% training/testing set to remove potential preclinical disease by excluding high WBC (> 9.57 x 109 cells/L) based on the reference range quoted by the manufacturer for the UKBB measurement.(38) Additional sensitivity analyses were conducted removing common CLL mCA drivers (e.g., mCAs on chr 6, 11,12, 13, 17). All statistical analyses were performed using a 64-bit build of R version 4.1.2 and two-sided significance levels were set at P < 0.05.
Methods by Maas et. al (2016) were implemented to develop an empirical model for evaluating absolute risk of CLL.(39,40) The R package iCARE(40) version 1.26.0. was used to estimate the absolute risk of developing CLL among the testing set of UKBB and in PLCO. The model was developed by combining information on the age-specific population incidence rate of CLL among Non-Hispanic White individuals from the National Cancer Institute Surveillance, Epidemiology, and End Results (SEER) Program cancer registry (2000-2021) (RRID: SCR_006902) (41) and accounting for competing cause mortality from the CDC WONDER database(RRID:SCR_025830).(42) Based on the age range of the UKBB dataset, the cumulative absolute risk of CLL was modeled from 40-80 years of age. UKBB was split into a 90/10 training-to-testing split and used the age-adjusted HRs from the training set to calculate the absolute risk of CLL among individuals in the UKBB testing set and PLCO as external validation. Absolute risk estimates from different models were compared to evaluate the change in absolute risk distribution when additional risk factors were incorporated. A sensitivity analysis in UKBB excluding high WBC and splitting the set into 70/30 training/testing sets was also performed. Blood cell traits and CHIP were also added to these models as they were available in UKBB. Risk distribution curves were overlayed for each model and the mean and median risks were summarized by each model for individuals who developed CLL during the study compared to those who did not.
Data Availability
Data used in this study are available for research purposes upon request and approval from the PLCO Cancer Data Access System (https://cdas.cancer.gov/plco/) and the UK Biobank genotype and phenotype data are publicly available under controlled access through access to the UK Biobank (https://www.ukbiobank.ac.uk/enable-your-research). All other raw data are available upon request from the corresponding author.
Results
The UKBB included 436,784 participants who were cancer-free at baseline, with an average age of 56.66 years. Of these participants, 53.40% were female, and the cohort was highly similar to European reference samples (mean proportion=0.96). There were 423 (0.10%) participants diagnosed with incident CLL in UKBB. The average follow-up time was 8.07 years (Median=8.13, SD=1.08, Range=0.01-11.40). CLL cases were generally older (Cases: Mean=61.91, SD=5.91; Controls: Mean=56.66, SD=8.11), more likely to be male (60.99% vs. 46.59%), and former smokers (41.37% vs. 34.02%), compared to non-cases (Supplemental Table 2).
The analytic population for PLCO consisted of 35,382 cancer-free individuals at baseline, with a mean age of 65 years. Of these participants, 55.36% were female and the cohort predominantly consisted of individuals highly similar to European reference samples (mean proportion=0.98). A total of 139 (0.39%) individuals were diagnosed with incident CLL in PLCO with an average follow-up time of 14.24 years. (Median=15.25, SD=4.39, Range=0.01-24.11). PLCO participants diagnosed with CLL were more likely to be male than non-cases (66.19% vs. 44.56%) (Supplemental Table 3).
Types of CH are strong predictors of CLL risk
mCAs
We estimated the effect of mCAs overall, by chromosome, and by 5 Mb windows across chromosomes on incident CLL in UKBB and PLCO. Adjusting for age, age-squared, sex, smoking status, and genetic similarity, any detectable autosomal mCA was associated with incident CLL (Hazards Ratio (HR)=25.53, 95% Confidence Interval (CI)=21.05-30.96, p-value=2.81×10−237) in UKBB. Autosomal mCAs were present in 48% of CLL cases and 3% of controls in UKBB. Chromosome-specific autosomal mCAs were more common in CLL cases across all chromosomes (Supplemental Table 4) and associated with increased CLL risk, with risk ranging from a HR of 6.71 (95%CI=1.67-27.03, p-value=7.4×10−3) on chromosome 21 to 182.9 (95%CI=142.44-234.85, p-value<1.00×10−287) on chromosome 13 (Figure 1- Panel A, Supplemental Table 5). Only 5 Mb segments with at least 10 mCAs in cases or controls were evaluated (Supplemental Table 6). mCAs on 5 Mb segments of chromosomes 13, 12, 22, 14, and 11 had the highest HRs (Supplemental Figure 3, Supplemental Table 7). Chromosome 13 displayed the strongest effect within the 5 Mb window (chr13:48445913-53445913, GRCh38) with the known 13q14 deletion (HR=199.71, 95%CI=155.28-256.85, p-value<1.00×10−287). Chromosome-specific mCAs were especially common in CLL cases across chromosomes 11, 12, 13, 14, and 22 at 5.44, 10.64, 21.04, 5.44, and 4.96 percent respectively in UKBB (Supplemental Table 4, Supplemental Table 7).
Figure 1.

Adjusted association between incident chronic lymphocytic leukemia (CLL) and chromosome-specific mosaic chromosomal alterations (mCAs) in A) UK Biobank (UKBB) and B) Prostate, Lung, Colorectal and Ovarian Screening Trial (PLCO). Analyses are adjusted for age, age-squared, sex, smoking status (Never/Former/Current), and genetic similarity. Detailed results are given in Supplemental Table 4 for UKBB and Supplemental Table 8 for PLCO.
Replication in PLCO was observed for autosomal mCAs being associated with incident CLL (HR=8.92, 95%CI=6.19-12.86, p-value=9.67×10−32) based on 30% prevalence in CLL cases and 5% of controls. Due to limited sample size, chromosome and chromosomal region events could not be comprehensively analyzed in PLCO. However, high proportions of mCAs on chromosomes 12 and 13 were observed at 7.91% and 12.23%, respectively among CLL cases (Supplemental Table 8). The presence of an autosomal mCA on each evaluated chromosome was also associated with increased CLL risk, with HRs ranging from 4.77 (95%CI=0.66-34.29, p-value=0.12) on chromosome 22 to 50.76 (95%CI=30.44-85.77, p-value=9.36×10−49) on chromosome 13 (Figure 1-Panel B, Supplemental Table 9). Window-specific analyses were not conducted in PLCO due to the limited number of CLL cases with mCAs in each 5 Mb window.
CHIP
We evaluated the association between any detectable CHIP mutation, myeloid CHIP, lymphoid CHIP and selected high frequency CHIP genes on incident CLL. A detectable CHIP mutation was present in 25% of cases and 4% of controls in UKBB. After adjustment for age, age-squared, sex, smoking status and genetic similarity, any detectable CHIP mutation was associated with incident CLL (HR=5.66, 95%CI=4.53-7.06, p-value=2.54×10−52) in UKBB (Figure 2). Mutations classified as Lymphoid CHIP (excluding those classified as Lymphoid and Myeloid) were associated with the highest HR for CLL 57.44 (95%CI=43.66-75.55, p-value=6.03×10−176). All other CHIP categories displayed a statistically significant increased risk for CLL except myeloid CHIP with an HR of 1.49 (95%CI=0.99-2.24, p-value=0.05) (Figure 2, Supplemental Table 10). We estimated gene specific risk among CH genes where at least 5 carriers developed CLL during the follow-up period (N=5 genes). CLL risk was associated with CHIP mutations in MYD88 (HR=120.24, 95%CI=70.35-205.50, p-value=1.41×10−58), NOTCH1 (HR=149.96, 95%CI=66.87-336.30, p-value=3.31×10−28), TP53 (HR=14.63, 95%CI=6.04-35.39, p-value=2.68×10−9), and XPO1 (HR=993.41, 95%CI=409.26-2411.32, p-value=2.61×10−13). An inverse but statistically insignificant effect was observed for DNMT3A (HR=0.67, 95%CI=0.33-1.35, p-value=0.26) (Supplemental Table 11). CHIP data was not available in PLCO for replication.
Figure 2.

Adjusted association between incident chronic lymphocytic leukemia (CLL) and clonal hematopoiesis of indeterminate potential (CHIP) overall and across different CHIP groups in the UK Biobank. Analyses are adjusted for age, age-squared, sex, smoking status (Never/Former/Current), and genetic similarity. Both CHIP includes driver genes that are not myeloid or lymphoid, such as TP53. Detailed results are given in Supplemental Table 9.
CLL germline susceptibility loci partitioning
The PGS for CLL was generated using significant CLL-associated germline genetic variants previously identified (Supplemental Table 1)8 and was significantly associated with CLL in UK Biobank and PLCO participants (OR=1.93, 95%CI=1.76-2.12, p-value=1.08×10−42 and OR=1.98 , 95%CI=1.68-2.34, p-value=4.59×10−16, per one standard deviation increase) respectively, with highly similar effect sizes (Supplemental Table 2 & 3).
We used ARCHIE to partition the PGS based on protein quantitative trait loci to evaluate potential components of the CLL PGS etiologically related to biologic mechanisms. ARCHIE partitioned the PGS into two sets of variants and corresponding proteins strongly trans-associated with each of these subsets (Supplemental Table 12), called “clusters”. Pathway enrichment of the first cluster (PGS1) included 23 variants and overrepresented proteins enriched in hematopoietic cell lineage (FDR=0.040), glycosaminoglycan biosynthesis (FDR=0.040), IgG binding (FDR=0.009), and cAMP response binding (FDR=0.019). The second cluster (PGS2) included 22 variants and was overrepresented in the Ras signaling pathway (FDR=0.024), NF-kappa B signaling pathway (FDR=0.047), and inflammatory response (FDR=0.019), specifically through upregulation of genes in cells expressing MLL-AF4 (FDR=0.002) and down-regulation of genes in affecting naive CD8 T cells and memory CD8 T cells (FDR=0.048).
mCAs mediate CLL germline susceptibility
We conducted mediation analyses of fifteen different risk factors with respect to the association between the CLL PGS and incident CLL. We identified two risk factors as significant mediators based on the following criteria: (1) Passing a Bonferroni-corrected threshold (P-value<0.003) and (2) proportion of mediation effects greater than 2% (Supplemental Table 13). Autosomal mCAs displayed the largest significant mediation between the PGS and CLL with OR = 1.019 (95%CI=1.012–1.025, p-value =2.95×10−8), accounting for 4.00% of the total effect (95%CI=2.53-5.47%). Investigating the PGS partitioned components mentioned above, the mediation proportion for autosomal mCAs was 2.49% of the total effect (95%CI=0.61%-4.37%) for PGS1 and 4.68% (95%CI=2.81%-6.55%) for PGS2, with no significant heterogeneity detected between the mediation proportions for PGS1 and PGS2 (Phetero=0.613). The second mediator was neutrophil percentage with an indirect effect OR=1.012 (95%CI=1.010-1.015, p-value=1.58×10−24, accounting for 2.55% of the total effect (95%CI=1.98% - 3.12%). For PGS component analyses, the mediation proportion of neutrophil percentage was 1.70% (95%CI= 0.97%-2.42%) for PGS1 and 2.76% (95%CI=2.03%-3.50%) for PGS2, with no significant heterogeneity observed between PGS1 and PGS2 mediation proportions (Phetero=0.218).
Incorporation of genetic information improves CLL risk stratification
We constructed sequential regression models to evaluate the added utility of germline and somatic factors to clinical and demographic factors for identifying individuals at high CLL risk. Models were carried out for 5-year, 5 to 10-year, and 10-year risk and included CLL cases diagnosed within 5 years, 5 to 10 years, and 10 years, respectively, of blood collection. An initial 5-year risk model in the UKBB test set with demographic and clinical factors (age, age-squared, smoking status, genetic similarity, blood cell traits) had a baseline AUC of 0.8372 (95%CI: 0.8257-0.9201) (Table 1). The AUC improved by adding germline genetics estimated using the CLL PGS (AUC=0.8617, 95%CI: 0.8466-0.9370). Incorporation of CH traits in addition to demographic and germline genetic factors resulted in AUCs of 0.8824 (95%CI: 0.8721-0.9516) for adding any autosomal mCA and 0.8770 (95%CI: 0.8575-0.9395) for separately adding any CHIP mutation. Models incorporating autosomal mCA cell fraction and mCAs on chromosome windows with the highest HRs were also fit, but these models had reduced discriminative ability relative to models with any autosomal mCA for UKBB (Models 2a-2h, Supplemental Table 14). Similarly, models including only specific CHIP types (e.g., lymphoid CHIP) had lower AUCs than any CHIP in the models (Supplemental Table 15). The final model with demographic and clinical factors, CLL PGS, any autosomal mCA, and any CHIP mutation, resulted in the highest discriminative ability (AUC=0.8864, 95%CI: 0.8739-0.9488). 5 to10-year and 10-year risk models were also fit and followed a similar pattern, although AUCs were lower in 5 to10 and 10-year risk models relative to 5-year risk models (Table 1). Sensitivity analyses using sequential models and a binary endpoint (not time-dependent) were plotted on ROC curves for visualization and displayed maximum discriminative ability in the final model that included all germline and somatic variables (Supplemental Figure 4). Sequential regression models were constructed by sex and displayed slightly higher AUCs among females compared to males upon addition of genetic variables in the 5-year risk models, although the pattern did not persist in the 10-year risk models (Supplemental Table 16).
Table 1.
Risk model area under the receiver operating curve (AUC) and 95% confidence interval (CI) for 5-year and 10-year risk of incident chronic lymphocytic leukemia (CLL) in UK Biobank (UKBB) test set sequentially adding genetic risk factors to the model. The model was constructed using 90% training and 10% testing set.
| UKBB Test Set | |||
|---|---|---|---|
| 5-year risk | 5 to 10-year risk | 10-year risk | |
| Model 0: Age, age-squared, sex, smoking status, genetic similarity, blood cell traits | 0.8372 (0.8257-0.9291) | 0.8210 (0.8070-0.8481) | 0.8391 (0.8205-0.8821) |
| Model 1: Model 0 + chronic lymphocytic leukemia (CLL) polygenic score (PGS) | 0.8617 (0.8466-0.9370) | 0.8373 (0.8233-0.8612) | 0. 8504 (0.8385-0.8949) |
| Model 2: Model 1 + any autosomal mosaic chromosomal alteration (mCA) | 0.8824 (0.8721-0.9516) | 0.8605 (0.8462-0.8816) | 0.8786 (0.8604-0.9199) |
| Model 3: Model 1 + any clonal hematopoiesis of indeterminate potential (CHIP) mutation | 0.8770 (0.8575-0.9395) | 0.8456 (0.8316-0.8662) | 0.8646 (0.8482-0.8987) |
| Model 4: Model 1 + any CHIP + any autosomal mCA | 0.8864 (0.8739-0.9488) | 0.8644 (0.8499-0.8845) | 0.8848 (0.8626-0.9182) |
External validation of the CLL risk models, was performed in PLCO, which is notable for older age, European ancestry only, and absence of blood cell trait and CHIP data. Consequently, subsequent UKBB analysis was restricted to >80% genetic similarity to European reference samples and the PLCO age range. In the restricted models, the 5-year risk for UKBB and PLCO displayed similar AUCs for demographic and clinical characteristics (UKBB AUC=0.6361, PLCO AUC=0.6138), demographic/clinical characteristics + germline genetics (UKBB AUC=0.7334, PLCO AUC=0.7202), and demographic/clinical characteristics + germline genetics + any autosomal mCAs (UKBB AUC=0.8567, PLCO AUC=0.8608) (Supplemental Table 14). As in the UKBB, an autosomal mCA had a higher AUC than models with cellular fraction or highest risk mCAs, and the 10-year risk model for PLCO showed some attenuation in discriminative ability from 5-year risk (Models 2a-2h, Supplemental Table 14).
After observing a pattern in which predictive utility was higher in 5-year risk models compared to 10-year risk models, we evaluated the adjusted HR of any autosomal mCA or any CHIP mutation over time. For mCAs, the HR was the strongest in the first year following CH trait measurement and waned over time in both PLCO and UKBB (Supplemental Figure 5, Supplemental Table 17). For CHIP mutations, the pattern was less clear and showed some evidence of decreasing HR from year 1 to 5 (1yr HR=5.7, 5yr HR=4.12), but the effect was large (HR=5.80) and significant (P=1.32×10−53) at year 10 (Supplemental Table 18).
Integrative models have utility in individuals lacking features of MBL
We performed sensitivity analyses in the UKBB dataset by removing individuals with high WBC (>9.57×109 cells/L) as they may have MBL, a precursor state to CLL. Due to the lower sample size, we divided the data into a 50/50 testing/training split. High AUCs persisted, and CH traits added additional information to our models (AUC=0.8572 for the full model) (Supplemental Table 19). Additional sensitivity analyses were conducted with removal of individuals with mCAs frequently altered in individuals with MBL and CLL (e.g., mCAs on chr 6, 11,12, 13, 17). Most individuals with mCAs known to be altered in CLL cases were previously removed by filtering for high WBC, resulting in similar observations when performing WBC filtering (Supplemental Table 19). Similarly, no individuals with CLL diagnosed in the first six months after baseline were present after the total WBC and mCA filters. Our results indicate that integrative CLL models that incorporate genetic risk factors have improved discriminative ability even in individuals who lack features of MBL.
Integrative genetic models improve absolute risk distribution
We estimated the absolute risk the average 40-year-old individual would have of being diagnosed with CLL by age 80, using the age-specific incidence of CLL and accounting for competing risk of mortality and the effect and population frequency of known CLL SNPs. Individual estimates of the HR of sex, smoking, genetic similarity, and blood cell traits were incorporated into the absolute risk prediction models, adjusted for age and age-squared. Additional absolute risk prediction incorporated the CLL PGS, autosomal mCAs, and CHIP effect estimates. Visualizing these risk curves displays a widening risk distribution with each added variable in both UKBB (Figure 3-Panel A) and PLCO (Figure 3-Panel B). Initially, the average individual of age 40 had a 0.4% risk of CLL by age 80 (Supplemental Table 20). Mean absolute risk scores were higher among CLL cases compared with controls generally and the mean absolute risks for cases grew significantly as additional genetic information was incorporated. The final model, including all risk factors (demographics, blood cell traits, CLL PGS and CH) had a mean absolute risk of 21% in CLL cases compared to 0.4% in controls. Sensitivity analyses that removed individuals with high WBC displayed a similar pattern with added genetic information widening the risk distribution (Supplemental Figure 6) and individuals that developed CLL had higher absolute risks as additional genetic information was added to the models (Supplemental Table 21).
Figure 3.

Overlayed density plots for the absolute risk of developing chronic lymphocytic leukemia (CLL) between ages 40 to 80 among A) UK Biobank (UKBB) and B) Prostate, Lung, Colorectal and Ovarian Screening Trial (PLCO) participants. Absolute risk was predicted using UKBB, age-specific incidence rates of CLL and competing risk of all-cause mortality. Different models were fit for absolute risk prediction. Plots were truncated to risks below 2% to visualize the primary peak of the distribution. Detailed absolute risk summaries by risk model are given in Supplemental Table 20.
Among individuals with normal blood cell counts, CLL cases were 1:1 age-matched to individuals who developed CLL in the UKBB testing set and absolute risks were categorized into <1%, 1%-5%, 5%-10% and > 10% across each of three models (Model 1: demographic risk factors, Model 2: model 1 + CLL PGS, Model 3: model 2 + any autosomal mCA + any CHIP mutation). Individual absolute risk estimates across models were plotted to show model discrimination and individual movement across absolute risk categories by incidence CLL status (Figure 4). As genetic information was incorporated into the models, larger fractions of those who developed CLL were classified into higher risk categories and lower risk categories increased proportionally to controls (Supplemental Table 22). For example, Model 1, containing only demographic risk factors, had 60% of CLL cases at <1% absolute risk and no CLL cases with greater than 5% risk; however, Model 3, including the CLL PGS and CH, had 41% of CLL cases in the <1% absolute risk category and 21% of cases in the >10% category. While the matched control population is restricted for ease of visualization, similar relationships are observed in the entire control population (Supplemental Table 22). The calibration plot (Supplemental Figure 7) demonstrates a close alignment between observed and predicted risks across risk deciles, with points clustering near the line representing perfect calibration.
Figure 4.

Individual estimated absolute risk of developing chronic lymphocytic leukemia (CLL) between ages 40 to 80 by incident CLL status across three risk models: 1) Model 1: Demographic factors and blood cell traits, 2) Model 2: Model 1 + CLL Polygenic Score (PGS), 3) Model 3: Model 2 + Any clonal hematopoiesis of indeterminate potential (CHIP) mutation + Any autosomal mosaic chromosomal alteration (mCA) in the UK Biobank (UKBB) testing set only including individuals with normal total white blood cell counts (WBC< 9.57x109 cells/L). Data summary is shown in Supplemental Table 22.
Discussion
We investigated demographic risk factors, germline genetic risk, and somatically acquired CH on risk of incident CLL in 436,784 cancer-free participants in the UKBB and 35,382 from PLCO. The integration of genetic information including a germline CLL-associated PGS, mCAs, and CHIP, with standard clinical information enhanced the ability to identify individuals at high risk of CLL. CH showed persistent discriminative ability after removing individuals with high WBC, known CLL-mCAs, and early disease diagnoses, indicating the relationship between CH and CLL is not entirely due to undiagnosed MBL or CLL. We found that mCAs detected on any autosomal chromosome were strongly associated with CLL risk, with improved discriminative ability when utilizing models with any autosomal mCA relative to mCAs in specific high-risk regions. For CHIP, we observed that including lymphoid and myeloid CHIP yielded higher AUCs than lymphoid CHIP alone. We also observed that mCA cell fraction did not improve discriminative ability over the presence or absence of an autosomal mCA. This was counter to expectation as we anticipated information on higher frequency clones would potentially improve discrimination ability, suggesting the presence or absence of mCAs and CHIP could be more useful for CLL discrimination than the fraction of mutated cells or the location of CH events.
A PGS was generated using 45 germline susceptibility variants associated with CLL risk(29,43,44) and a PGS partitioning framework was employed,(34) which identified two groups of CLL-associated variants associated with sets of distal genes highlighting associations with inflammatory pathways and hematopoietic stem cell lineage. We identified autosomal mCAs and neutrophil percentage as potential mediators of the effect of the CLL PGS on incident CLL, indicating some germline risk of CLL could act through somatically acquired CH. There is also a germline genetic component to CH and blood cell traits,(45-48) potentially offering further insights into germline mechanisms relevant to CLL risk, such as immune or inflammation pathways, apoptosis, and telomere length.
We observed greater discrimination with added genetic information in our absolute risk prediction models. Sequential models incorporating more genetic information moved the peak of the CLL absolute risk distribution closer to zero absolute risk while the right tail of the distribution became longer. After stratification of individuals by incident CLL, we note that the mean and median absolute risk among the case population increases with each additional trait added to the model. The integrative absolute risk models generally assigned lower risk values to individuals who did not develop CLL during the study period, though some still had high estimated absolute risk. It remains unclear whether these individuals would develop CLL with longer follow-up. As CLL is often a slow progressing disease and asymptomatic in the early stages, capturing individuals at high risk may provide a powerful opportunity for developing interventions to prevent progression to malignancy. The effect of mCAs waned over time in both UKBB and PLCO study populations, indicating that longitudinal measurements of CH are likely important for accurate CLL risk assessment.
Our study leveraged two large, well-characterized cohorts, allowing for the investigation of incident CLL, incorporation of both germline and somatic genetic information, and adjustment for known confounders. Exclusion of prevalent cancers reduced the potential for reverse causation and minimized potential confounding from cancer treatments.(32,49) While CLL is known to be more common in European populations,(2,50) the lack of diversity in our study population is a limitation, as we were unable to evaluate model discriminative ability in individuals of non-European genetic backgrounds. The use of CLL incidence rates and competing risk of mortality from U.S. resources in absolute risk models for UKBB is also a limitation. As the datasets we used also lacked clinical MBL data, we were unable to fully remove MBL cases from our analytic set; although, sensitivity analyses removing individuals with high WBC, CLL diagnosis in the first 5 years, and mCAs commonly observed in MBL cases showed stable added discriminative ability of incorporated genetic information.
Our investigation provides initial insights into the improved CLL risk stratification of integrative models that include germline and somatic genetic information. Given the slow progression of CLL in some patients, future research should explore how germline genetics and CH can be harnessed for improved etiologic knowledge of CLL carcinogenesis and clinical utility, as we do not recommend the integrative genetic models presented here be used clinically for individual CLL risk prediction. A more complete understanding of the environmental, germline, and somatic etiologic forces that shape the progression of normal cells down a trajectory toward CLL will be valuable for identifying targeted interventions to prevent CLL.
Supplementary Material
Acknowledgments:
This work was conducted using data from the UK Biobank (application # 55288 and 21552). The UK Biobank was established by the Wellcome Trust, the Medical Research Council, the United Kingdom Department of Health, and the Scottish Government. The UK Biobank has also received funding from the Welsh Assembly Government, the British Heart Foundation, and Diabetes UK. This work uses data from PLCO. PLCO is supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics, and contracts from the Division of Cancer Prevention, NCI, NIH, DHHS. The authors thank the NCI study management team, the screening center investigators, and staff at Information Management Services, Inc. and Westat, Inc. Most importantly, we thank the study participants for their contributions that made this study possible. Cancer incidence data have been provided by the Alabama Statewide Cancer Registry, Arizona Cancer Registry, Colorado Central Cancer Registry, District of Columbia Cancer Registry, Georgia Cancer Registry, Hawaii Cancer Registry, Cancer Data Registry of Idaho, Maryland Cancer Registry, Michigan Cancer Surveillance Program, Minnesota Cancer Surveillance System, Missouri Cancer Registry, Nevada Central Cancer Registry, Ohio Cancer Incidence Surveillance System, Pennsylvania Cancer Registry, Texas Cancer Registry, Utah Cancer Registry, Virginia Cancer Registry, and Wisconsin Cancer Reporting System. All are supported in part by funds from the Center for Disease Control and Prevention, National Program for Central Registries, local states or by the National Cancer Institute, Surveillance, Epidemiology, and End Results program. The results reported here, and the conclusions derived are the sole responsibility of the authors. The opinions expressed by the authors are their own and this material should not be interpreted as representing the official viewpoint of the U.S. Department of Health and Human Services, the National Institutes of Health or the National Cancer Institute.
Financial support:
This work was supported by the intramural research program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health (A.K.H., D.W.B., W.Z.D., A.D., S.S., S.H.L., B.B., I.D.B, W.Y.H., N.D.F., H.Z., D.D., S.J.C., M.J.M.).
Abbreviations:
- CLL
Chronic Lymphocytic Leukemia
- CH
Clonal Hematopoiesis
- CHIP
Clonal Hematopoiesis of Indeterminate Potential
- mCA
Mosaic Chromosomal Alteration
- UKBB
UK Biobank
- PLCO
Prostate Lung Colorectal and Ovarian Screening Trial
- PGS
Polygenic Score
- HR
Hazards Ratio
- CI
Confidence Interval
- AUC
Area Under the Receiver Operating Curve
- MBL
monoclonal B-cell lymphocytosis
- FDR
False Discovery Rate
Footnotes
Conflicts of Interest Statement: Dr. Bolton reports research funding from Servier and serving on their medical advisory board. No other authors have potential conflicts or additional funding sources to disclose.
References
- 1.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2022. CA Cancer J Clin. 2022;72:7–33. [DOI] [PubMed] [Google Scholar]
- 2.Ou Y, Long Y, Ji L, Zhan Y, Qiao T, Wang X, et al. Trends in Disease Burden of Chronic Lymphocytic Leukemia at the Global, Regional, and National Levels From 1990 to 2019, and Projections Until 2030: A Population-Based Epidemiologic Study. Front Oncol. 2022. Mar 10;12:840616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Goldin LR, Pfeiffer RM, Li X, Hemminki K. Familial risk of lymphoproliferative tumors in families of patients with chronic lymphocytic leukemia: results from the Swedish Family-Cancer Database. Blood. 2004;104:1850–4. [DOI] [PubMed] [Google Scholar]
- 4.Di Bernardo MC, Crowther-Swanepoel D, Broderick P, Webb E, Sellick G, Wild R, et al. A genome-wide association study identifies six susceptibility loci for chronic lymphocytic leukemia. Nat Genet. Nature Publishing Group; 2008;40:1204–10. [DOI] [PubMed] [Google Scholar]
- 5.Slager SL, Rabe KG, Achenbach SJ, Vachon CM, Goldin LR, Strom SS, et al. Genome-wide association study identifies a novel susceptibility locus at 6p21.3 among familial CLL. Blood. 2011;117:1911–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Slager SL, Caporaso NE, de Sanjose S, Goldin LR. Genetic susceptibility to chronic lymphocytic leukemia. Semin Hematol. 2013;50:296–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Berndt SI, Skibola CF, Joseph V, Camp NJ, Nieters A, Wang Z, et al. Genome-wide association study identifies multiple risk loci for chronic lymphocytic leukemia. Nat Genet. Nature Publishing Group; 2013;45:868–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sava GP, Speedy HE, Di Bernardo MC, Dyer MJS, Holroyd A, Sunter NJ, et al. Common variation at 12q24.13 (OAS3) influences chronic lymphocytic leukemia risk. Leukemia. Nature Publishing Group; 2015;29:748–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Berndt SI, Camp NJ, Skibola CF, Vijai J, Wang Z, Gu J, et al. Meta-analysis of genome-wide association studies discovers multiple loci for chronic lymphocytic leukemia. Nat Commun. Nature Publishing Group; 2016;7:10933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Law PJ, Berndt SI, Speedy HE, Camp NJ, Sava GP, Skibola CF, et al. Genome-wide association analysis implicates dysregulation of immunity genes in chronic lymphocytic leukaemia. Nat Commun. 2017;8:14175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Speedy HE, Di Bernardo MC, Sava GP, Dyer MJS, Holroyd A, Wang Y, et al. A genome-wide association study identifies multiple susceptibility loci for chronic lymphocytic leukemia. Nat Genet. 2014;46:56–60. [DOI] [PubMed] [Google Scholar]
- 12.Döhner H, Stilgenbauer S, James MR, Benner A, Weilguni T, Bentz M, et al. 11q deletions identify a new subset of B-cell chronic lymphocytic leukemia characterized by extensive nodal involvement and inferior prognosis. Blood. 1997;89:2516–22. [PubMed] [Google Scholar]
- 13.Zenz T, Vollmer D, Trbusek M, Smardova J, Benner A, Soussi T, et al. TP53 mutation profile in chronic lymphocytic leukemia: evidence for a disease specific profile from a comprehensive analysis of 268 mutations. Leukemia. 2010;24:2072–9. [DOI] [PubMed] [Google Scholar]
- 14.Gaidano G, Rossi D. The mutational landscape of chronic lymphocytic leukemia and its impact on prognosis and treatment. Hematol Am Soc Hematol Educ Program. 2017;2017:329–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Niroula A, Sekar A, Murakami MA, Trinder M, Agrawal M, Wong WJ, et al. Distinction of lymphoid and myeloid clonal hematopoiesis. Nat Med. Nature Publishing Group; 2021;27:1921–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Quesada V, Conde L, Villamor N, Ordóñez GR, Jares P, Bassaganyas L, et al. Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia. Nat Genet. 2011;44:47–52. [DOI] [PubMed] [Google Scholar]
- 17.Machiela MJ, Zhou W, Caporaso N, Dean M, Gapstur SM, Goldin L, et al. Mosaic 13q14 deletions in peripheral leukocytes of non-hematologic cancer cases and healthy controls. J Hum Genet. 2016;61:411–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Landau DA, Tausch E, Taylor-Weiner AN, Stewart C, Reiter JG, Bahlo J, et al. Mutations driving CLL and their evolution in progression and relapse. Nature. Nature Publishing Group; 2015;526:525–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Puente XS, Pinyol M, Quesada V, Conde L, Ordóñez GR, Villamor N, et al. Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature. 2011;475:101–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Puente XS, Beà S, Valdés-Mas R, Villamor N, Gutiérrez-Abril J, Martín-Subero JI, et al. Non-coding recurrent mutations in chronic lymphocytic leukaemia. Nature. 2015;526:519–24. [DOI] [PubMed] [Google Scholar]
- 21.Baliakas P, Hadzidimitriou A, Sutton L-A, Rossi D, Minga E, Villamor N, et al. Recurrent mutations refine prognosis in chronic lymphocytic leukemia. Leukemia. 2015;29:329–36. [DOI] [PubMed] [Google Scholar]
- 22.Fabbri G, Rasi S, Rossi D, Trifonov V, Khiabanian H, Ma J, et al. Analysis of the chronic lymphocytic leukemia coding genome: role of NOTCH1 mutational activation. J Exp Med. 2011;208:1389–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.D’Agaro T, Bittolo T, Bravin V, Dal Bo M, Pozzo F, Bulian P, et al. NOTCH1 mutational status in chronic lymphocytic leukaemia: clinical relevance of subclonal mutations and mutation types. Br J Haematol. 2018;182:597–602. [DOI] [PubMed] [Google Scholar]
- 24.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Prorok PC, Andriole GL, Bresalier RS, Buys SS, Chia D, David Crawford E, et al. Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial. Control Clin Trials. 2000;21:273S–309S. [DOI] [PubMed] [Google Scholar]
- 26.Machiela MJ, Huang W-Y, Wong W, Berndt SI, Sampson J, De Almeida J, et al. GWAS Explorer: an open-source tool to explore, visualize, and access GWAS summary statistics in the PLCO Atlas. Sci Data. Nature Publishing Group; 2023;10:25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jin Y, Schaffer AA, Feolo M, Holmes JB, Kattman BL. GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis. G3 Bethesda Md. 2019;9:2447–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Brown DW, Cato LD, Zhao Y, Nandakumar SK, Bao EL, Rehling T, et al. Shared and distinct genetic etiologies for different types of clonal hematopoiesis. Nat Commun. 2023; 14, 5536. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Loh P-R, Genovese G, Handsaker RE, Finucane HK, Reshef YA, Palamara PF, et al. Insights about clonal hematopoiesis from 8,342 mosaic chromosomal alterations. Nature. 2018;559:350–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Loh P-R, Genovese G, McCarroll SA. Monogenic and polygenic inheritance become instruments for clonal selection. Nature. 2020;584:136–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bick AG, Weinstock JS, Nandakumar SK, Fulco CP, Bao EL, Zekavat SM, et al. Inherited causes of clonal haematopoiesis in 97,691 whole genomes. Nature. 2020;586, 763–768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bolton KL, Ptashkin RN, Gao T, Braunstein L, Devlin SM, Kelly D, et al. Cancer therapy shapes the fitness landscape of clonal hematopoiesis. Nat Genet. Nature Publishing Group; 2020;52:1219–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Depaulis A, Brown DW, Hubbard AK UKBBcleanR: Prepare electronic medical record data from the UK Biobank for time-to-event analyses. 2022. Available from: https://github.com/machiela-lab/UKBBcleanR [Google Scholar]
- 34.Dutta D, He Y, Saha A, Arvanitis M, Battle A, Chatterjee N. Aggregative trans-eQTL analysis detects trait-specific target gene sets in whole blood. Nat Commun. Nature Publishing Group; 2022;13:4323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Yoshida K, Li Y. regmedint: Regression-Based Causal Mediation Analysis with Interaction and Effect Modification Terms. 2020. page 1.0.1. Available from: https://CRAN.R-project.org/package=regmedint [Google Scholar]
- 36.Valeri L, VanderWeele TJ. Mediation analysis allowing for exposure–mediator interactions and causal interpretation: Theoretical assumptions and implementation with SAS and SPSS macros. Psychol Methods. US: American Psychological Association; 20130204;18:137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Potapov S, Schmid WA and M. survAUC: Estimators of Prediction Accuracy for Time-to-Event Data. 2023. Available from: https://cran.r-project.org/web/packages/survAUC/index.html [Google Scholar]
- 38.Sheard SM, Nicholls R, Froggatt J. UK Biobank Haematology Data Companion Document. UK Biobank; 2017. Available from: https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/haematology.pdf [Google Scholar]
- 39.Maas P, Barrdahl M, Joshi AD, Auer PL, Gaudet MM, Milne RL, et al. Breast Cancer Risk From Modifiable and Nonmodifiable Risk Factors Among White Women in the United States. JAMA Oncol. 2016;2:1295–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Choudhury PP, Maas P, Wilcox A, Wheeler W, Brook M, Check D, et al. iCARE: An R package to build, validate and apply absolute risk models. PLOS ONE. Public Library of Science; 2020;15:e0228198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) SEER*Stat Database. Incidence - SEER Research Data, 17 Registries, Nov 2023 Sub (2000-2021) - linked to County Attributes -Time Dependent (1990-2022) Income/Rurality, 1969-2022 Counties,.
- 42.National Center for Health Statistics (NCHS). Underlying Cause of Death 1999-2020 on CDC WONDER Online Database. released in 2021. vital Statistics Cooperative Program.; Available from: http://wonder.cdc.gov/ucd-icd10.html [Google Scholar]
- 43.Kandaswamy R, Sava GP, Speedy HE, Beà S, Martín-Subero JI, Studd JB, et al. Genetic Predisposition to Chronic Lymphocytic Leukemia Is Mediated by a BMF Super-Enhancer Polymorphism. Cell Rep. 2016;16:2061–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yan H, Tian S, Kleinstern G, Wang Z, Lee J-H, Boddicker NJ, et al. Chronic lymphocytic leukemia (CLL) risk is mediated by multiple enhancer variants within CLL risk loci. Hum Mol Genet. 2020;29:2761–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Silver AJ, Bick AG, Savona MR. Germline risk of clonal haematopoiesis. Nat Rev Genet. Nature Publishing Group; 2021;22:603–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zekavat SM, Lin S-H, Bick AG, Liu A, Paruchuri K, Wang C, et al. Hematopoietic mosaic chromosomal alterations increase the risk for diverse types of infection. Nat Med. Nature Publishing Group; 2021;27:1012–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Vuckovic D, Bao EL, Akbari P, Lareau CA, Mousas A, Jiang T, et al. The Polygenic and Monogenic Basis of Blood Traits and Diseases. Cell. 2020;182:1214–1231.e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Astle WJ, Elding H, Jiang T, Allen D, Ruklisa D, Mann AL, et al. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell. 2016;167:1415–1429.e19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Hubbard AK, Brown DW, Machiela MJ. Clonal hematopoiesis due to mosaic chromosomal alterations: impact on disease risk and mortality. Leuk Res. 2023;126:107022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Yang S, Varghese AM, Sood N, Chiattone C, Akinola NO, Huang X, et al. Ethnic and geographic diversity of chronic lymphocytic leukaemia. Leukemia. Nature Publishing Group; 2021;35:433–9. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data used in this study are available for research purposes upon request and approval from the PLCO Cancer Data Access System (https://cdas.cancer.gov/plco/) and the UK Biobank genotype and phenotype data are publicly available under controlled access through access to the UK Biobank (https://www.ukbiobank.ac.uk/enable-your-research). All other raw data are available upon request from the corresponding author.
