Abstract
Although multiple common susceptibility loci for lung cancer (LC) have been identified by genome-wide association studies, they can explain only a small portion of heritability. The etiological contribution of rare deleterious variants (RDVs) to LC risk is not fully characterized and may account for part of the missing heritability. Here, we sequenced the whole exomes of 2777 participants from the Environment and Genetics in Lung cancer Etiology study, a homogenous population including 1461 LC cases and 1316 controls. In single-variant analyses, we identified a new RDV, rs77187983 [EHBP1, odds ratio (OR) = 3.13, 95% confidence interval (CI) = 1.34–7.30, P = 0.008] and replicated two previously reported RDVs, rs11571833 (BRCA2, OR = 2.18; 95% CI = 1.25–3.81, P = 0.006) and rs752672077 (MPZL2, OR = 3.70, 95% CI = 1.04–13.15, P = 0.044). In gene-based analyses, we confirmed BRCA2 (P = 0.007) and ATM (P = 0.014) associations with LC risk and identified TRIB3 (P = 0.009), involved in maintaining genome stability and DNA repair, as a new candidate susceptibility gene. Furthermore, cases were enriched with RDVs in homologous recombination repair [carrier frequency (CF) = 22.9% versus 19.5%, P = 0.017] and Fanconi anemia (CF = 12.5% versus 10.2%, P = 0.036) pathways. Our results were not significant after multiple testing corrections but were enriched in cases versus controls from large scale public biobank resources, including The Cancer Genome Atlas, FinnGen and UK Biobank. Our study identifies novel candidate genes and highlights the importance of RDVs in DNA repair-related genes for LC susceptibility. These findings improve our understanding of LC heritability and may contribute to the development of risk stratification and prevention strategies.
Introduction
Lung cancer (LC) remains the leading cause of cancer mortality worldwide (1). Although ~81.7% of LCs have been reported to be related to tobacco smoking, genetic factors play a crucial role in its pathogenesis (2). During the past decade, multiple common susceptibility loci have been identified in LC through genome-wide association studies (GWAS) (3). While occurring at relatively high frequency in populations (minor allele frequency ≥5%), these variants together account for only 12.3% of familial heritability in European populations, and most of them are located in non-coding regions with unclear functions (3). Rare germline variants, especially those in gene coding regions with potential adverse consequences, may explain at least part of the remaining heritability.
Rare deleterious variants (RDVs) have been reported to be associated with the risk of various cancer types (4–7). However, the identified RDVs in LC are poorly replicated across whole-exome sequencing (WES) studies with mixed populations of European descent (mix-EU), implying that some of them might be population-specific (8–11). Although the lack of reproducibility could be because of other factors, large studies focusing on a homogeneous population may help to identify novel disease-causing RDVs that contributes to LC susceptibility. The Environment and Genetics in Lung cancer Etiology (EAGLE) study offers such a possibility. As one of the largest population-based LC case–control studies in the world, EAGLE aimed to comprehensively explore the full spectrum of LC etiology in the Lombardy region of Italy integrating epidemiological, molecular and clinical resources (12).
Results
Single-variant association test
Thirty risk RDV candidates were detected through single-variant association test with P-value <0.05 (Table 1). Among them, rs11571833 (p.K3326X), a previously reported truncating variant in the core DNA repair gene BRCA2, showed the strongest association [odds ratio (OR) = 2.18, 95% confidence interval (CI): 1.25–3.81, P = 0.006] with overall LC risk. Its effect size is broadly consistent with the former findings from several independent studies of overall LC (OR = 2.36, mix-EU), squamous LC (OR = 2.47, mix-EU) and small-cell LC (OR = 2.06, Icelandic population) (9,13,14). We also replicated a frameshift deletion, rs752672077 (p.I24M, MPZL2, OR = 3.70, 95% CI: 1.04–13.15, P = 0.044) recently identified by the Transdisciplinary Research into Cancer of the Lung (TRICL) WES study (10), with similar effect size (OR = 3.88, mix-EU). Both variants in BRCA2 and MPZL2 were statistically significant in the original studies (9,10,13,14); thus, our findings strongly support these previous observations.
Table 1.
Candidate RDVs for LC risk identified in the EAGLE study
Variant ID | Gene | REF | ALT | EAFa Case/Control | OR (95% CI) and P-valueb | EAFPublic anda Case/Control | |||
---|---|---|---|---|---|---|---|---|---|
Overall LC | LUAD | LUSC | |||||||
Previously reported | rs11571833 | BRCA2 (p.K3326X) | A | T | 0.0144/0.0068 | 2.18 (1.25–3.81) 0.006 | 1.73 (0.87–3.44) 0.120 | 1.71 (0.77–3.78) 0.189 | 0.0208/0.0084 |
rs752672077 | MPZL2 (p.I24M) | CT | C | 0.0041/0.0011 | 3.70 (1.04–13.15) 0.044 | 3.53 (0.86–14.55) 0.081 | 2.66 (0.53–13.34) 0.236 | 0.0027/0.0012 | |
Newly identified (EAF_TCGA > EAF_gnomAD) | rs77187983 | EHBP1 (p.D590V) | A | T | 0.0082/0.0027 | 3.13 (1.34–7.30) 0.008 | 3.73 (1.45–9.59) 0.006 | 3.04 (1.08–8.56) 0.035 | 0.0093/0.0061 |
rs61735090 | PTN (p.E151T) | C | A | 0.0185/0.0110 | 1.71 (1.08–2.71) 0.022 | 1.81 (3.13–2.11) 0.035 | 1.76 (0.95–3.25) 0.071 | 0.0099/0.0070 | |
rs74596180 | LMO7 (p.G269R) | G | C | 0.0058/0.0019 | 3.17 (1.17–8.64) 0.024 | 2.76 (0.86–8.85) 0.088 | 4.86 (1.56–15.19) 0.006 | 0.0027/0.0014 | |
rs141941152 | OR4C15 (p.L240F) | C | T | 0.0058/0.0019 | 3.04 (1.12–8.28) 0.030 | 4.94 (1.74–14.09) 0.003 | 1.13 (0.20–6.22) 0.89 | 0.0093/0.0061 | |
rs142175729 | SLC5A8 (p.L337F) | G | A | 0.0051/0.0019 | 2.83 (1.02–7.82) 0.045 | 3.17 (1.02–9.84) 0.046 | 3.81 (1.11–13.05) 0.033 | 0.0082/0.0076 | |
rs17153879 | FANK1 (p.P12L) | C | T | 0.0034/0.0008 | 4.65 (1.01–21.37) 0.048 | 1.99 (0.27–14.83) 0.501 | 2.18 (0.30–15.78) 0.439 | 0.0082/0.0059 | |
Newly identified (not found in TCGA or gnomAD) | rs201122874 | WDR27 (p.R349X) | G | A | 0.0072/0.0023 | 3.17 (1.27–7.90) 0.013 | 2.32 (0.77–6.99) 0.135 | 4.15 (1.44–11.94) 0.008 | NA/0.0012 |
rs201679838 | FBXO27 (p.H255P) | T | G | 0.0051/0.0011 | 4.51 (1.30–15.67) 0.018 | 5.04 (1.32–19.3) 0.018 | 3.91 (0.8–19.04) 0.091 | NA/0.0011 | |
rs139794951 | OR52I2 (NA) | TGAGTATG | T | 0.0058/0.0015 | 4.03 (1.35–12.07) 0.013 | 4.19 (1.21–14.54) 0.024 | 5.19 (1.46–18.44) 0.011 | 0.0038/NA | |
rs757387484 | C5orf45 (p.S162R) | CCTGCAC GCCACGG | C | 0.0041/0.0008 | 5.34 (1.19–24.00) 0.029 | 3.93 (0.71–21.81) 0.117 | 5.33 (0.95–30.01) 0.058 | NA/0.0006 | |
rs139552233 | ATM (p.S978P) | T | C | 0.0038/0.0008 | 4.81 (1.06–21.76) 0.042 | 6.02 (1.19–30.37) 0.030 | 2.93 (0.39–22.13) 0.298 | NA/0.0007 | |
rs200745939 | KIAA1683 (p.Q1002X) | G | A | 0.0031/0.0004 | 7.95 (1.00–63.01) 0.050 | 10.87 (1.29–91.76) 0.028 | 11.15 (1.13–110.4) 0.039 | NA/0.0004 | |
Newly identified (EAF_TCGA ≤ EAF_gnomAD) | rs200132735 | SEC16B (p.G523E) | C | T | 0.0065/0.0015 | 4.46 (1.51–13.17) 0.007 | 6.55 (2.07–20.7) 0.001 | 2.39 (0.51–11.31) 0.271 | 0.0006/0.0030 |
rs61756067 | NUP153 (p.P478L) | G | A | 0.0099/0.0046 | 2.22 (1.13–4.38) 0.021 | 2.42 (1.1–5.31) 0.027 | 1.48 (0.57–3.87) 0.422 | 0.0082/0.0093 | |
rs147288996 | HARS (p.G205D) | C | T | 0.0041/0.0004 | 11.02 (1.43–85.01) 0.021 | NA in LUAD | 23.52 (2.95–187.6) 0.003 | 0.0016/0.0022 | |
rs35139099 | NFXL1 (p.R432C) | G | A | 0.0068/0.0023 | 2.88 (1.16–7.12) 0.022 | 3.07 (1.11–8.45) 0.030 | 2.79 (0.92–8.5) 0.070 | 0.0038/0.0041 | |
rs527905870 | GFRAL (p.I339N) | C | CA | 0.0048/0.0011 | 4.28 (1.23–14.94) 0.023 | 6.34 (1.7–23.65) 0.006 | 3.91 (0.85–18.08) 0.081 | 0.0011/0.0014 | |
rs117318472 | UNC45A (p.W138X) | C | T | 0.0068/0.0027 | 2.71 (1.14–6.44) 0.024 | 2.28 (0.82–6.39) 0.116 | 4.36 (1.57–12.08) 0.005 | 0.0049/0.0054 | |
rs140886939 | PSMD6 (p.G341R) | C | T | 0.0041/0.0008 | 5.61 (1.25–25.19) 0.024 | 9.05 (1.91–43.0) 0.006 | 2.64 (0.36–19.18) 0.338 | 0.0016/0.0018 | |
rs144648271 | KRTAP27–1 (p.Q86X) | G | A | 0.0068/0.0027 | 2.65 (1.11–6.30) 0.028 | 2.73 (1.02–7.29) 0.046 | 2.16 (0.66–7.04) 0.203 | 0.0011/0.0020 | |
rs117234242 | CARS (p.G342S) | C | T | 0.0099/0.0049 | 2.08 (1.08–4.03) 0.029 | 2.70 (1.29–5.64) 0.008 | 0.87 (0.28–2.75) 0.813 | 0.0115/0.0087 | |
rs183343406 | FAT1 (p.P1614L) | G | A | 0.0079/0.0034 | 2.36 (1.09–5.13) 0.030 | 2.25 (0.9–5.62) 0.084 | 1.83 (0.64–5.28) 0.261 | 0.0011/0.0033 | |
rs121912691 | SLC3A1 (p.M467T) | T | C | 0.0044/0.0011 | 4.00 (1.14–14.11) 0.031 | 2.50 (0.55–11.37) 0.236 | 5.75 (1.41–23.43) 0.015 | 0.0016/0.0041 | |
rs41313325 | COL15A1 (p.I1304M) | A | G | 0.0051/0.0015 | 3.31 (1.09–10.01) 0.035 | 3.13 (0.87–11.31) 0.081 | 3.11 (0.86–11.23) 0.083 | 0.0011/0.0035 | |
rs139367894 | EYA2 (p.G247R) | G | A | 0.0038/0.0008 | 4.99 (1.10–22.58) 0.037 | 4.14 (0.79–21.82) 0.093 | 5.17 (0.92–29.10) 0.063 | 0.0005/0.0006 | |
rs114985471 | XXYLT1 (p.P391L) | G | A | 0.0048/0.0015 | 3.09 (1.01–9.44) 0.048 | 2.23 (0.55–9.02) 0.259 | 2.95 (0.77–11.32) 0.115 | 0.0022/0.0056 | |
rs1046806 | ARRDC2 (p.R137Q) | G | A | 0.0031/0.0004 | 8.10 (1.02–64.21) 0.048 | 12.21 (1.45–102.7) 0.021 | 6.44 (0.54–76.17) 0.140 | 0.0033/0.0036 | |
rs138043992 | KIF15 (p.R501L) | G | T | 0.0045/0.0015 | 3.09 (1.00–9.54) 0.049 | 2.75 (0.73–10.41) 0.137 | 3.94 (1.0–15.49) 0.049 | 0.0027/0.0035 |
a EAF of case and control in the EAGLE study.
b P-values were generated by logistic model, adjusted for sex, age and ancestry (the first 10 principal components).
Public cases are LC patients of European ancestry from TCGA; public controls are non-cancer non-Finnish European populations from gnomAD. LUAD: lung adenocarcinoma. Previously reported candidates with P-value <0.05 and newly identified candidate variants with P-value <0.01 are in bold.
Furthermore, we observed association with two novel missense RDVs: rs77187983 (p.D590V, EHBP1, OR = 3.13, 95% CI: 1.34–7.30, P = 0.008) and rs200132735 (p.G523E, SEC16B, OR = 4.46, 95%CI: 1.51–13.17, P = 0.007). These results were not corrected for multiple testing, but by comparing the effect allele frequencies (EAF) of these RDVs between 914 LC cases of European descent from The Cancer Genome Atlas (TGCA) and 114 704 non-cancer non-Finnish European-ancestry populations from the Genome Aggregation Database (gnomAD), we found that the BRCA2 rs11571833 (EAF = 0.0208 versus 0.0084), MPZL2 rs752672077 (EAF = 0.0027 versus 0.0012) and EHBP1 rs77187983 (EAF = 0.0093 versus 0.0061) were enriched in LC cases compared with general population controls. However, the SEC16B rs200132735 was depleted in the LC cases from TCGA. Furthermore, rs11571833 and rs77187983 were suggestively associated with LC risk in 1681 LC cases versus 173 993 controls from the Finnish population (P = 0.002) and 2007 LC cases and 359 187 controls from the British populations (P = 0.028), on the basis of the FinnGen (https://r5.finngen.fi/) and UK Biobank (https://www.ukbiobank.ac.uk/) studies, respectively.
Gene-based association test
In gene-based analyses, 12 genes were identified to be LC risk candidates with P-value < 0.05 (Table 2). Among them, we found a gene-level association of BRCA2 with overall LC risk (P = 0.007) and discovered a novel candidate susceptibility gene, TRIB3 (P = 0.009), primarily driven by Lung squamous cell carcinoma (LUSC) (P = 0.0003). Moreover, ATM, a previously identified LC susceptibility factor that can regulate the DNA repair network upon double-strand breaks, was also suggestively associated (P = 0.014) with LC risk. The association of BRCA2 was primarily driven by the single-variant rs11571833, whereas the association of TRIB3 and ATM reflected the cumulative effect from multiple RDVs with EAF <0.5% (Fig. 1). Although these gene-based associations were not significant after multiple testing correction, they were observed in the UK Biobank 300K WES project with varying degrees of evidence all below suggestive P-values < 0.05 (Supplementary Material, Table S1), indicating that they could be truly associated genes, which warrant further functional investigation.
Table 2.
Gene-based association tests in the EAGLE study by SKAT-O
Gene | Frequency (%) | P EAGLE | P UK Biobank | |||
---|---|---|---|---|---|---|
Cases/Controls | Overall LC | LUAD | LUSC | Overall LC | ||
Previously Reported | BRCA2 | 3.4/1.8 | 0.007 | 0.146 | 0.324 | 0.001 |
ATM | 2.9/2.4 | 0.014 | 0.063 | 0.257 | 0.002 | |
Newly identified | TRIB3 | 1.2/0.9 | 0.009 | 0.450 | 0.0003 | 0.041 |
BLK | 2.2/1.0 | 0.013 | 0.122 | 0.008 | 0.032 | |
TMEM67 | 1.3/0.4 | 0.017 | 0.009 | 0.008 | 0.020 | |
NTRK1 | 2.0/0.8 | 0.020 | 0.115 | 0.005 | 0.087 | |
EHBP1 | 2.7/1.8 | 0.020 | 0.014 | 0.046 | 0.083 | |
INSRR | 1.4/0.5 | 0.023 | 0.321 | 0.003 | 0.001 | |
LMO7 | 2.3/1.4 | 0.028 | 0.161 | 0.005 | 0.013 | |
TTN | 5.1/3.3 | 0.030 | 0.048 | 0.134 | 0.080 | |
FLG | 1.6/0.7 | 0.034 | 0.614 | 0.005 | 0.007 | |
LMF1 | 0.8/0.5 | 0.040 | 0.003 | 0.640 | 0.004 |
Only P-value of the most significant associations with overall LC risk from the UK Biobank are provided for each candidate gene as supporting evidence. The full detailed information (e.g. UK Biobank phenotype code, collapsing models applied, OR) is listed in Supplementary Material, Table S1. Previously reported candidates with P-value <0.05 and newly identified candidate genes with P-value <0.01 are in bold.
Figure 1.
Distribution of RDVs in BRCA2, ATM and TRIB3 in LC cases and controls from the EAGLE study. Rare variants with P-value <0.01 are labeled in figure.
Pathway-based association test
In pathway-based analyses, we found suggestive evidence for overall enrichment of RDVs in the homologous recombination repair pathway in cases versus controls (carrier frequency [CF] = 22.9% versus 19.5%, P = 0.017), supporting a potential role for DNA repair in genetic predisposition to LC (Table 3). The Fanconi anemia pathway including genes involved in the recognition or repair of DNA inter-strand crosslinks also emerged as a potential risk-conferring candidate (CF = 12.5% versus 10.2%, P = 0.036), in agreement with previous findings (15). After excluding BRCA2 from the gene sets, RDVs of homologous recombination repair (CF = 20.05% versus 18.09%, P = 0.27) and Fanconi anemia (CF = 9.65% versus 8.66%, P = 0.33) pathways were still enriched in cases compared with controls, although they were not statistically significant any longer, as seen in studies of other cancer types (16).
Table 3.
Pathway-based association tests in the EAGLE study by SKAT-O
Pathway | Frequency (%) | P-value | ||
---|---|---|---|---|
Cases/Controls | Overall LC | LUAD | LUSC | |
Homologous recombination | 22.9/19.5 | 0.017 | 0.167 | 0.158 |
Fanconi anemia | 12.5/10.2 | 0.036 | 0.068 | 0.550 |
Tumor Necrosis Factor (TNF) signaling | 6.6/5.5 | 0.065 | 0.527 | 0.215 |
Ataxia Telangiectasia Mutated (ATM) signaling | 11.6/10.9 | 0.114 | 0.468 | 0.216 |
DNA damage/telomere stress-induced senescence | 10.0/8.8 | 0.123 | 0.289 | 0.459 |
Base excision repair | 12.6/14.4 | 0.149 | 0.457 | 0.110 |
Phosphoinositide 3-kinases (PI3K)/AKT serine/threonine kinase (AKT) signaling | 12.9/15.1 | 0.303 | 0.103 | 0.809 |
Mechanistic Target Of Rapamycin Kinase (mTOR) pathway | 5.1/6.4 | 0.305 | 0.432 | 0.771 |
Cell cycle | 17.0/17.9 | 0.355 | 0.415 | 0.762 |
Vitamin A and carotenoid metabolism | 14.0/16.6 | 0.368 | 0.705 | 0.278 |
Apoptosis | 9.4/10.0 | 0.408 | 0.405 | 0.311 |
Mismatch repair | 7.4/8.6 | 0.484 | 0.495 | 0.131 |
DNA damage response | 10.6/10.0 | 0.486 | 0.506 | 0.855 |
Non-homologous DNA end joining | 5.7/6.8 | 0.491 | 0.636 | 0.460 |
Cytokines and inflammatory response | 0.68/0.61 | 0.504 | 0.260 | 0.551 |
Nucleotide excision repair | 8.0/7.3 | 0.558 | 0.872 | 0.880 |
Notch Receptor 1 signaling pathway | 8.8/9.6 | 0.626 | 0.311 | 0.497 |
Nicotine degradation | 2.3/2.7 | 0.673 | 0.353 | 0.498 |
Signaling by Epidermal Growth Factor Receptor (EGFR) in cancer | 2.2/2.4 | 0.753 | 0.878 | 0.782 |
Vitamin D metabolism | 3.4/3.3 | 0.771 | 0.756 | 0.663 |
Vascular Endothelial Growth Factor (VEGF) signaling | 11.2/11.5 | 0.783 | 0.939 | 0.704 |
Previously reported candidates with P-value <0.05 and newly identified candidate pathways with P-value <0.01 are in bold.
Discussion
Previous studies exploring the association between rare variants and LC risk were on the basis of limited numbers of subjects from different populations of European descent, without additionally considering population structure (8–11). In the present study, through large-scale WES analysis in a homogenous population with verified absence of population stratification, we not only replicated known RDVs, susceptibility genes and pathways, but also identified potential novel candidates for LC risk, which may help to improve our knowledge of LC biology and heritability.
The newly identified RDVs by single-variant tests are localized in two genes (EHBP1 and SEC16B) both involved in vesicular trafficking (17,18), although only p.D590V in EHBP1, which conveyed a three-fold increased LC risk, was confirmed to be enriched in TCGA LC cases versus gnomAD controls. Interestingly, EHBP1 was also reported to be associated with aggressive prostate cancer risk by previous large-scale GWAS analysis over 38 500 participants of European ancestry (19), indicating its potentially important susceptibility function across multiple cancer types. Further experimental exploration on the role of this predisposing candidate in LC risk is warranted.
Through gene-based analyses, TRIB3 was identified to be a novel LC susceptibility candidate gene. As a guardian of genome integrity, TRIB3 has been shown to respond to several cellular stresses and participate in maintaining genome stability and DNA repair (20,21). Although it does not belong to the core members of DNA repair system, TRIB3 is able to interact with a wide variety of DNA repair genes, including the CtIP-Rb-BRCA1-ATM network (20). Within the same network, we confirmed BRCA2 and ATM as important susceptibility genes, whereas at the pathway level, homologous recombination and Fanconi anemia repair pathways were enriched in LC cases versus controls. Collectively, these findings highlight the importance of DNA repair in LC susceptibility and confirm the value of using ancestry-distinct populations for the identification of rare susceptibility variants.
Although EAGLE is the largest single study of WES of LC from a homogenous population, associations at whole-exome level after multiple testing adjustments are only suggestive, as in previous studies (10). To address this limitation, we systematically investigated several public databases (LC cases in TCGA versus population controls in gnomAD, and LC cases and controls in the UK Biobank and FinnGen). We observed an enrichment of these newly identified RDVs and genes in LC cases, supporting these associations, which can extend beyond the Italian population. Our results represent a significant advance in the understanding of LC genetic susceptibility and may serve as a foundation for personalized prevention and possibly treatment for this lethal disease.
Materials and Methods
Study populations
Participants of the EAGLE study were enrolled from 216 municipalities in the Lombardy region, the most populated area of Italy with almost 10 million inhabitants, covering five specific cities (Milan, Monza, Brescia, Pavia and Varese) and surrounding towns and villages. Informed consent was obtained from all subjects before study initiation. The corresponding protocol was approved by the Institutional Review Board of the US National Cancer Institute (NCI) and the involved hospitals and universities in Italy. The current high-throughput sequencing study included 1461 verified, incident, primary LC cases and 1316 population-based healthy controls, which had adequate amounts of good-quality genomic DNA as well as matched whole-genome single nucleotide polymorphisms (SNP) genotyping data for assessing population structure. LC cases and controls were frequency-matched by residence, gender and age (Supplementary Material, Table S2). No population stratification was observed by the principal component analysis on the basis of paired GWAS data (Supplementary Material, Fig. S1).
WES and data analysis
WES was performed at the Cancer Genomics Research (CGR) Laboratory in the NCI as previously described (22). A total of 1.1 μg of captured genomic DNA from each sample was subjected to undergo paired-end high-throughput sequencing at Illumina HiSeq platform (Illumina, San Diego, CA), with average sequencing depth of 53.7×. Raw reads in FASTQ format were filtered and adapter trimmed by Trimmomatic v0.32 (23), and subsequently aligned to the human reference genome hg19 by the Novoalign package (v3.00.05, http://www.novocraft.com). Alignments were further processed for quality control by excluding duplicate reads with the Picard package. Afterward, RealignerTargetCreator and IndelRealigner modules from the Genome Analysis Toolkit (GATK v3.1) were used for local realignment around variant sites of both insertion and deletion (24).
Three software packages were used for germline variant discovery and genotype calling, including UnifiedGenotyper and HaplotypeCaller modules from GATK and the FreeBayes variant caller (v9.9.2) (25). Results from different callers were consolidated by an ensemble variant calling pipeline (v0.2.2, http://bcb.io/2013/02/06/an-automated-ensemble-method-for-combining-and-evaluating-genomic-variants-from-multiple-callers/). The Support Vector Machine learning algorithm was further applied to identify an optimal decision boundary and produce a more balanced decision between false and true positives. In order to select variants with high confidence, those meeting one of the following criteria were excluded: (i) variants with total read depth < 10; (ii) variants failed to pass the CGR pipeline quality control metric (‘CScorefilter’); (iii) heterozygous variants with high or low allele balance (ABHet < 0.2 or >0.8); (iv) variants identified in gnomAD database but failed to pass their corresponding quality control filters (‘AC0’, ‘RF’ or ‘InbreedingCoeff’) (26,27).
Variant annotation was performed using a custom CGR in-house script that integrates ANNOVAR (28), SnpEff (29) and SnpSift (30) on the basis of several public resources, including refGene, Ensembl, UCSC KnownGene database, the dataset from University of Washington’s Exome Sequencing Project (ESP6500) (http://evs.gs.washington.edu/EVS/), Human nonsynonymous SNPs and function predictions database (dbNSFP v41, https://sites.google.com/site/jpopgen/dbNSFP), Single Nucleotide Polymorphism database (dbSNP build 137, https://www.ncbi.nlm.nih.gov/snp/), the 1000 Genomes Project (https://www.internationalgenome.org/) and the Human Gene Mutation Database (HGMD, http://www.hgmd.cf.ac.uk/ac/index.php).
Identification of RDVs
Germline variants were considered as rare if they met the following two criteria that were implemented by the previous LC WES projects (9,11): (i) EAF ≤ 2% in both LC cases and controls from the current study; (ii) EAF ≤ 1% in the non-cancer and non-Finnish European-ancestry populations from the gnomAD v2.1.1. RDVs were further identified with one of the following criteria: (i) rare variants that were annotated to be non-synonymous frameshift or stop-gain; (ii) rare variants that were annotated to be nonsynonymous variants and predicted to be deleterious by at least five out of seven variant functional impact prediction tools, including Sorting Intolerant from Tolerant (D), MutationAssessor (H), MutationTaster (D), PolyPhen2 (D), Likelihood Ratio Test (D), Functional Analysis Through Hidden Markov Models (D) and PROVEAN (D).
Association tests
Single-variant association testing for identified RDVs was performed with a logistic regression model implemented in PLINK v1.9.0 (31), adjusting for age, gender and ancestry with the top 10 principal components using the matched SNP array from our previous GWAS study (32). To reduce the false positive rate, genes with at least five RDVs in both cases and controls were retained for gene-based association analysis using Optimal Sequence Kernel Association Test (SKAT-O, v2.0.1), with the same adjustments as for the single-variant analysis (33). To perform pathway-based analysis, we focused on 21 potential candidate pathways that have been previously reported to be related with LC on the basis of literature curation, including five DNA repair-related pathways (34). Their core members were obtained from the Human Pathway Unification Database (PathCards, https://pathcards.genecards.org/) (35), QIAGEN Ingenuity Pathway Analysis (IPA) (36) and literatures as a supplement (Supplementary Material, Table S3). After binning RDVs into pathway level, association analysis was performed using the same algorithm as gene-based analysis. RDVs, genes and pathways candidates were regarded to be potential risk factors if their burden were significantly more enriched in LC cases than controls with a suggestive P-value of <0.05 (11,15). The newly identified candidates with a stricter P-value of <0.01 for the association with LC risk were highlighted in this study. All statistical tests are two-sided.
Validation resources
In order to verify the identified associations, we investigated our major findings in the following publicly available human biobank resources: (i) TCGA (https://portal.gdc.cancer.gov/) (37), which provides germline WES dataset from 914 LC patients in European-ancestry; (ii) gnomAD (https://gnomad.broadinstitute.org/) (38), which provides detailed information of a comprehensive set of human genetic variants from 114 704 non-cancer individuals in Non-Finnish European-ancestry; (iii) FinnGen project (https://r5.finngen.fi/) (39), which provides genetic variant association analysis summary statistics results from 218 792 Finnish individuals; (iv) UK Biobank 300K WES project (https://azphewas.com) (40), which provides rare variant association analysis summary statistics results from 281 104 exomes in European-ancestry.
Supplementary Material
Acknowledgements
This work utilized the computational resources of the NIH high-performance computational capabilities Biowulf cluster (http://hpc.nih.gov). We are thankful to the patients and families who contributed to this study and the researchers who are involved in the EAGLE study (https://eagle.cancer.gov/).
Conflict of Interest statement. None declared.
Contributor Information
Jian Sang, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Tongwu Zhang, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Jung Kim, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Mengying Li, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Angela C Pesatori, Department of Clinical Sciences and Community Health, University of Milan, Milan 20122, Italy.
Dario Consonni, Epidemiology Unit, Fondazione IRCCS Ca’ Granda—Ospedale Maggiore Policlinico, Milan 20122, Italy.
Lei Song, Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Frederick, MD 21701, USA.
Jia Liu, Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Frederick, MD 21701, USA.
Wei Zhao, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Phuc H Hoang, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Dave S Campbell, Information Management Services, Rockville, MD 20850, USA.
James Feng, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Monica E D’Arcy, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Naoise Synnott, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Yingxi Chen, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Zeni Wu, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Bin Zhu, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Xiaohong R Yang, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Kevin M Brown, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Jiyeon Choi, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Jianxin Shi, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Maria Teresa Landi, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Funding
Intramural research funding, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health.
Data Availability
The sequencing data that support the findings of this study are deposited at the dbGaP (https://www.ncbi.nlm.nih.gov/gap/) under accession no. phs002496.v1.p1.
References
- 1. Siegel, R.L., Miller, K.D., Fuchs, H.E. and Jemal, A. (2021) Cancer statistics, 2021. CA Cancer J. Clin., 71, 7–33. [DOI] [PubMed] [Google Scholar]
- 2. Islami, F., Goding Sauer, A., Miller, K.D., Siegel, R.L., Fedewa, S.A., Jacobs, E.J., McCullough, M.L., Patel, A.V., Ma, J., Soerjomataram, I. et al. (2018) Proportion and number of cancer cases and deaths attributable to potentially modifiable risk factors in the United States. CA Cancer J. Clin., 68, 31–54. [DOI] [PubMed] [Google Scholar]
- 3. McKay, J.D., Hung, R.J., Han, Y., Zong, X., Carreras-Torres, R., Christiani, D.C., Caporaso, N.E., Johansson, M., Xiao, X., Li, Y. et al. (2017) Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nat. Genet., 49, 1126–1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Catalano, C., Paramasivam, N., Blocka, J., Giangiobbe, S., Huhn, S., Schlesner, M., Weinhold, N., Sijmons, R., de Jong, M., Langer, C. et al. (2021) Characterization of rare germline variants in familial multiple myeloma. Blood Cancer J., 11, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Goldstein, A.M., Xiao, Y., Sampson, J., Zhu, B., Rotunno, M., Bennett, H., Wen, Y., Jones, K., Vogt, A., Burdette, L. et al. (2017) Rare germline variants in known melanoma susceptibility genes in familial melanoma. Hum. Mol. Genet., 26, 4886–4895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Karlsson, Q., Brook, M.N., Dadaev, T., Wakerell, S., Saunders, E.J., Muir, K., Neal, D.E., Giles, G.G., MacInnis, R.J., Thibodeau, S.N. et al. (2021) Rare germline variants in ATM predispose to prostate cancer: a PRACTICAL Consortium Study. Eur. Urol. Oncol., 4, 570–579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Yepes, S., Shah, N.N., Bai, J., Koka, H., Li, C., Gui, S., McMaster, M.L., Xiao, Y., Jones, K., Wang, M. et al. (2021) Rare germline variants in chordoma-related genes and chordoma susceptibility. Cancers (Basel), 13, 2704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Liu, Y., Kheradmand, F., Davis, C.F., Scheurer, M.E., Wheeler, D., Tsavachidis, S., Armstrong, G., Simpson, C., Mandal, D., Kupert, E. et al. (2016) Focused analysis of exome sequencing data for rare germline mutations in familial and sporadic lung cancer. J. Thorac. Oncol., 11, 52–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Liu, Y.H., Lusk, C.M., Cho, M.H., Silverman, E.K., Qiao, D.D., Zhang, R.Y., Scheurer, M.E., Kheradmand, F., Wheeler, D.A., Tsavachidis, S. et al. (2018) Rare variants in known susceptibility loci and their contribution to risk of lung cancer. J. Thorac. Oncol., 13, 1483–1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Liu, Y.H., Xia, J., McKay, J., Tsavachidis, S., Xiao, X.J., Spitz, M.R., Cheng, C., Byun, J., Hong, W., Li, Y.F. et al. (2021) Rare deleterious germline variants and risk of lung cancer. Npj Precis. Oncol., 5, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Selvan, M.E., Zauderer, M.G., Rudin, C.M., Jones, S., Mukherjee, S., Offit, K., Onel, K., Rennert, G., Velculescu, V.E., Lipkin, S.M. et al. (2020) Inherited rare, deleterious variants in atm increase lung adenocarcinoma risk. J. Thorac. Oncol., 15, 1871–1879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Landi, M.T., Consonni, D., Rotunno, M., Bergen, A.W., Goldstein, A.M., Lubin, J.H., Goldin, L., Alavanja, M., Morgan, G., Subar, A.F. et al. (2008) Environment and genetics in lung cancer etiology (EAGLE) study: an integrative population-based case-control study of lung cancer. BMC Public Health, 8, 203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Wang, Y.F., Mckay, J.D., Rafnar, T., Wang, Z.M., Timofeeva, M.N., Broderick, P., Zong, X.C., Laplana, M., Wei, Y.Y., Han, Y.H. et al. (2014) Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat. Genet., 46, 736–741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Rafnar, T., Sigurjonsdottir, G.R., Stacey, S.N., Halldorsson, G., Sulem, P., Pardo, L.M., Helgason, H., Sigurdsson, S.T., Gudjonsson, T., Tryggvadottir, L. et al. (2018) Association of BRCA2 K3326*with small cell lung cancer and squamous cell cancer of the skin. J. Natl. Cancer Inst., 110, 967–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Selvan, M.E., Klein, R.J. and Gumus, Z.H. (2019) Rare, pathogenic germline variants in Fanconi anemia genes increase risk for squamous lung cancer. Clin. Cancer Res., 25, 1517–1525. [DOI] [PubMed] [Google Scholar]
- 16. Darst, B.F., Dadaev, T., Saunders, E., Sheng, X., Wan, P., Pooler, L., Xia, L.Y., Chanock, S., Berndt, S.I., Gapstur, S.M. et al. (2021) Germline sequencing DNA repair genes in 5545 men with aggressive and nonaggressive prostate cancer. J. Natl. Cancer Inst., 113, 616–625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Naslavsky, N. and Caplan, S. (2005) C-terminal EH-domain-containing proteins: consensus for a role in endocytic trafficking, EH? J. Cell Sci., 118, 4093–4101. [DOI] [PubMed] [Google Scholar]
- 18. Yonekawa, S., Furuno, A., Baba, T., Fujiki, Y., Ogasawara, Y., Yamamoto, A., Tagaya, M. and Tani, K. (2011) Sec16B is involved in the endoplasmic reticulum export of the peroxisomal membrane biogenesis factor peroxin 16 (Pex16) in mammalian cells. P. Natl. Acad. Sci. USA, 108, 12746–12751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Gudmundsson, J., Sulem, P., Rafnar, T., Bergthorsson, J.T., Manolescu, A., Gudbjartsson, D., Agnarsson, B.A., Sigurdsson, A., Benediktsdottir, K.R., Blondal, T. et al. (2008) Common sequence variants on 2p15 and Xp11.22 confer susceptibility to prostate cancer. Nat. Genet., 40, 281–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Aynaud, M.M., Suspene, R., Vidalain, P.O., Mussil, B., Guetard, D., Tangy, F., Wain-Hobson, S. and Vartanian, J.P. (2012) Human tribbles 3 protects nuclear DNA from cytidine deamination by APOBEC3A. J. Biol. Chem., 287, 39182–39192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Yokoyama, T. and Nakamura, T. (2011) Tribbles in disease: signaling pathways important for cellular function and neoplastic transformation. Cancer Sci., 102, 1115–1122. [DOI] [PubMed] [Google Scholar]
- 22. Shi, J., Yang, X.R., Ballew, B., Rotunno, M., Calista, D., Fargnoli, M.C., Ghiorzo, P., Bressac-de Paillerets, B., Nagore, E., Avril, M.F. et al. (2014) Rare missense variants in POT1 predispose to familial cutaneous malignant melanoma. Nat. Genet., 46, 482–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Bolger, A.M., Lohse, M. and Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. et al. (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Yang, X.R., Rotunno, M., Xiao, Y., Ingvar, C., Helgadottir, H., Pastorino, L., van Doorn, R., Bennett, H., Graham, C., Sampson, J.N. et al. (2016) Multiple rare variants in high-risk pancreatic cancer-related genes may increase risk for pancreatic cancer in a subset of patients with and without germline CDKN2A mutations. Hum. Genet., 135, 1241–1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Subramanian, D.N., Zethoven, M., McInerny, S., Morgan, J.A., Rowley, S.M., Lee, J.E.A., Li, N., Gorringe, K.L., James, P.A. and Campbell, I.G. (2020) Exome sequencing of familial high-grade serous ovarian carcinoma reveals heterogeneity for rare candidate susceptibility genes. Nat. Commun., 11, 1640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Yepes, S., Tucker, M.A., Koka, H., Xiao, Y., Jones, K., Vogt, A., Burdette, L., Luo, W., Zhu, B., Hutchinson, A. et al. (2020) Using whole-exome sequencing and protein interaction networks to prioritize candidate genes for germline cutaneous melanoma susceptibility. Sci. Rep., 10, 17198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Wang, K., Li, M. and Hakonarson, H. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res., 38, e164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S.J., Lu, X. and Ruden, D.M. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin), 6, 80–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Cingolani, P., Patel, V.M., Coon, M., Nguyen, T., Land, S.J., Ruden, D.M. and Lu, X. (2012) Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program. SnpSift. Front. Genet., 3, 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet., 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Landi, M.T., Chatterjee, N., Yu, K., Goldin, L.R., Goldstein, A.M., Rotunno, M., Mirabello, L., Jacobs, K., Wheeler, W., Yeager, M. et al. (2009) A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am. J. Hum. Genet., 85, 679–691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Lee, S., Emond, M.J., Bamshad, M.J., Barnes, K.C., Rieder, M.J., Nickerson, D.A., Team, N.G.E.S.P.-E.L.P., Christiani, D.C., Wurfel, M.M. and Lin, X. (2012) Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet., 91, 224–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Chatterjee, N. and Walker, G.C. (2017) Mechanisms of DNA damage, repair, and mutagenesis. Environ. Mol. Mutagen., 58, 235–263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran, M. and Lancet, D. (2015) PathCards: multi-source consolidation of human biological pathways. Database, 2015, bav006. 10.1093/database/bav006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Kramer, A., Green, J., Pollard, J., Jr. and Tugendreich, S. (2014) Causal analysis approaches in ingenuity pathway analysis. Bioinformatics, 30, 523–530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Tomczak, K., Czerwinska, P. and Wiznerowicz, M. (2015) The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. (Pozn), 19, A68–A77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Koch, L. (2020) Exploring human genomic diversity with gnomAD. Nat. Rev. Genet., 21, 448. [DOI] [PubMed] [Google Scholar]
- 39. Locke, A.E., Steinberg, K.M., Chiang, C.W.K., Service, S.K., Havulinna, A.S., Stell, L., Pirinen, M., Abel, H.J., Chiang, C.C., Fulton, R.S. et al. (2019) Exome sequencing of Finnish isolates enhances rare-variant association power. Nature, 572, 323–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Wang, Q., Dhindsa, R.S., Carss, K., Harper, A.R., Nag, A., Tachmazidou, I., Vitsios, D., Deevi, S.V.V., Mackay, A., Muthas, D. et al. (2021) Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature, 597, 527–532. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The sequencing data that support the findings of this study are deposited at the dbGaP (https://www.ncbi.nlm.nih.gov/gap/) under accession no. phs002496.v1.p1.