Abstract
Introduction
Genome-wide association studies (GWAS) have consistently identified specific lung cancer susceptibility regions. We evaluated the lung cancer predictive performance of single nucleotide polymorphisms (SNPs) in these regions.
Methods
Lung cancer cases (N=778) and controls (N=1166) were genotyped for 77 SNPs located in GWAS-identified lung cancer susceptibility regions. Variable selection and model development used stepwise logistic regression and decision-tree analysis. In a subset nested in the Pittsburgh Lung Screening Study, change in area under the receiver operator characteristic curve (AUC) and net reclassification improvement (NRI) were used to compare predictions made by risk factor models with and without genetic variables.
Results
Variable selection and model development kept two SNPs in each of three GWAS regions, rs2736100 and rs7727912 in 5p15.33, rs805297 and rs1802127 in 6p21.33, and rs8034191 and rs12440014 in 15q25.1. The ratio of cases to controls was three times higher among subjects with a high-risk genotype in every one as opposed to none of the three GWAS regions (odds ratio 3.14, 95% confidence interval 2.02-4.88, adjusted for sex, age and pack-years). Adding a three-level classified count of GWAS regions with high-risk genotypes to an age and smoking risk factor-only model improved lung cancer prediction by a small amount: AUC 0.725 vs. 0.717 (P=0.056); NRIoverall was 0.052 across low, intermediate, and high 6-year lung cancer risk categories (<3.0%, 3.0% to 4.9%, ≥5.0%).
Conclusions
Specifying genotypes for SNPs in three GWAS-identified susceptibility regions improved lung cancer prediction, but probably by an extent too small to affect disease control practice.
Keywords: lung cancer, single nucleotide polymorphism, risk prediction
Introduction
Typically using many hundreds of thousands of single nucleotide polymorphisms (SNPs) and working to identify regions of the human genome that associate with a particular trait or disease, genome-wide association studies (GWAS) compare the frequency of common inherited variation between two groups of individuals. Comparing individuals with and without lung cancer, this approach has reproducibly identified several susceptibility regions of interest.[1-5] Although these genomic regions contain multiple potentially relevant genes, we currently lack a firm biological understanding of the causal molecular processes that link them to lung cancer.[6] The regions implicated in GWAS of lung cancer may act indirectly by affecting intensity of cigarette smoking or personal susceptibility to chronic obstructive pulmonary disease (COPD),[7] two potent lung cancer risk factors.[8]
The identification of susceptibility loci has stimulated interest in the use of genetic information to help identify persons at particularly high risk of developing lung cancer.[9] Genetic information might be used to improve the prediction accuracy of currently available models based on cigarette smoking and age.[10-15] Selecting persons according to predictions made by more accurate models may help improve the efficiency and safety of lung cancer screening with low-dose computed tomography (CT),[16] a public health relevant issue in light of the recent Centers for Medicare & Medicaid Services funding decision adding lung cancer screening to Medicare.[17] With these thoughts in mind, we aimed to evaluate the predictive performance of common SNPs located in regions previously identified in GWAS of lung cancer or COPD.[1-5,18-27]
Materials and Methods
Study subjects
Consented subjects originated from two studies approved by the University of Pittsburgh Institutional Review Board (IRB). A clinical study contributed 45-85 year-old ≥10 pack-year current or former cigarette smokers who received diagnostic, staging, or therapeutic surgery after 2001 at the University of Pittsburgh Medical Center for non-carcinoid lung cancer and who provided a research blood sample within 1 year of diagnosis. The second group of subjects came from the Pittsburgh Lung Screening Study (PLuSS), a Pittsburgh-based CT lung cancer screening study that enrolled, between 2002 and 2005, 50-79 year-old current or former cigarette smokers (≥half pack/day for ≥25 years, quit ≤10 years).[28] This second study group included a set of lung cancer cases, and a lung cancer-free simple random sample, enriched with persons found to have one or more benign non-calcified nodules, 5-19 mm in diameter on first CT or growing or new on follow-up CT. Our genotyping studies added persons with nodules to enhance parallel studies of circulating protein-based biomarkers and early lung cancer detection.
These two sources identified 2292 individuals in all, including 2189 (95.5% of 2292) with DNA available for analysis and 2117 (96.7% of 2189) with a DNA sample that passed all genotype quality control checks, including SNP call rate ≥95%. After 173 exclusions (five for missing sex, one for age <50, 10 for <10 pack years, 14 for lung cancer not biopsy confirmed, and 143 for non-white race), 1944 subjects (91.8% of 2117) remained, 778 lung cancer cases (600 clinical and 178 PLuSS) and 1166 lung cancer-free controls (861 selected at random and 305 chosen intentionally for nodule presence). Active lung cancer surveillance after an initial CT screening extended a median 7.3 years (5th to 95th percentile 1.2 to 10.8 years) and a median 10.4 years (5th to 95th percentile 8.8 to 11.7 years) for control subjects who were dead and alive, respectively, at last contact. To reduce confounding related to population stratification, analyses excluded non-white race subjects and used ancestry-informative markers to verify white race.
For clinical cases, ancillary data (sex, age at diagnosis, cigarette pack-years, and spirometry results) were extracted from medical records. For PLuSS cases and controls, ancillary data collected via standardized self-administered questionnaire included smoking intensity (cigarettes/day), years of smoking, family history of lung cancer, respiratory symptoms (cough, phlegm, and wheeze), and medical history (doctor diagnosis of emphysema or bronchitis). As previously described,[29] additional baseline information available for the PLuSS cases and controls included severity of airflow limitation measured by spirometry and severity of emphysema seen on computed tomography.
SNP selection and genotyping
An Illumina® Custom GoldenGate 384 SNP Panel was used to genotype the 778 lung cancer cases and 1166 controls. The panel included in total 77 SNPs (listed in Supplemental Table 1) located in chromosomal regions identified by GWAS for association with lung cancer risk or COPD. Two SNPs failed (genotype called in <95% of samples) and a third violated Hardy-Weinberg equilibrium (P<0.001). Subsequent analyses restricted to the remaining 74 SNPs identified that in our study population 18 SNPs were associated with lung cancer risk at Ptrend < 0.1 (Table 1). Efforts to develop parsimonious models started with these 18 SNPs, two, six, and ten in chromosomal regions 5p15.33, 6p21.3, and 15q25.1, respectively.
Table 1. SNPs associated with lung cancer risk at Ptrend < 0.1a.
| rs number | Chr | Positionb | Gene | Region | Casesc | Controlsc | OR | 95% CI | Ptrend |
|---|---|---|---|---|---|---|---|---|---|
| rs2736100 | 5 | 1286401 | TERT | 5p15.33 | 224/396/158 | 292/618/256 | 0.89 | 0.78-1.02 | 0.091 |
| rs7727912 | 5 | 1318845 | CLPTM1L | 5p15.33 | 611/155/10 | 967/193/6 | 1.32 | 1.07-1.64 | 0.010 |
| rs805297 | 6 | 31654829 | APOM, BAG6 | 6p21.33 | 360/324/84 | 467/482/161 | 0.84 | 0.73-0.96 | 0.010 |
| rs2295663 | 6 | 31701518 | ABHD16A,MIR4646 | 6p21.33 | 672/101/5 | 1043/119/4 | 1.33 | 1.02-1.73 | 0.033 |
| rs805293 | 6 | 31720741 | LY6G6C | 6p21.33 | 225/391/162 | 296/579/291 | 0.86 | 0.75-0.98 | 0.019 |
| rs805304 | 6 | 31730311 | CLIC1, DDAH2 | 6p21.33 | 306/371/101 | 520/511/134 | 1.16 | 1.01-1.33 | 0.030 |
| rs707939 | 6 | 31758911 | MSH5 | 6p21.33 | 320/358/99 | 439/534/193 | 0.86 | 0.75-0.98 | 0.023 |
| rs1802127 | 6 | 31762148 | MSH5 | 6p21.33 | 732/46/0 | 1119/46/1 | 1.45 | 0.96-2.19 | 0.075 |
| rs13180 | 15 | 78497146 | IREB2 | 15q25.1 | 332/357/89 | 446/545/175 | 0.84 | 0.74-0.96 | 0.012 |
| rs4362358 | 15 | 78503762 | IREB2‖HYKK | 15q25.1 | 335/354/89 | 450/542/174 | 0.84 | 0.74-0.96 | 0.012 |
| rs8034191 | 15 | 78513681 | HYKK | 15q25.1 | 270/374/134 | 469/546/151 | 1.23 | 1.08-1.40 | 0.002 |
| rs4461039 | 15 | 78525105 | HYKK | 15q25.1 | 523/230/25 | 710/403/53 | 0.78 | 0.67-0.92 | 0.004 |
| rs16969968 | 15 | 78590583 | CHRNA5 | 15q25.1 | 276/378/124 | 471/545/149 | 1.19 | 1.04-1.36 | 0.010 |
| rs12914385 | 15 | 78606381 | CHRNA3 | 15q25.1 | 229/390/159 | 409/570/186 | 1.23 | 1.08-1.41 | 0.002 |
| rs12443170 | 15 | 78615394 | CHRNA3 | 15q25.1 | 641/126/11 | 877/265/18 | 0.71 | 0.58-0.87 | 0.001 |
| rs8042374 | 15 | 78615690 | CHRNA3 | 15q25.1 | 521/231/26 | 697/413/55 | 0.77 | 0.65-0.90 | 0.001 |
| rs12440014 | 15 | 78634384 | CHRNB4 | 15q25.1 | 519/232/27 | 682/414/69 | 0.73 | 0.62-0.85 | 0.000 |
| rs1316971 | 15 | 78638168 | CHRNB4 | 15q25.1 | 546/210/22 | 744/377/45 | 0.78 | 0.66-0.92 | 0.004 |
LEGEND: Chr – chromosome; OR – odds ratio per minor allele;CI – confidence interval.
Odds ratio, 95% confidence interval, and Ptrend results from logistic regression.
Based on dbSNP Human Build 142 (GRCh38).
Case and control counts: common allele homozygote / heterozygote / minor allele homozygote.
Statistical analysis
Using a data set that replaced missing genotypes with values imputed from haplotype frequencies estimated by expectation maximization,[30] we used stepwise logistic regression (Pentry < 0.25, Pstay < 0.15), one GWAS region at a time, to select SNPs independently associated with lung cancer risk. Using weights to balance the numbers of cases and controls and working one GWAS region at a time, we used a WEKA [31] implementation of the C4.5 algorithm [32] to translate the independent SNP variables into decision trees. To evaluate the effects of genetic variables on lung cancer risk prediction, we restricted analyses to the lung cancer cases nested in PLuSS, along with the comparable controls selected at random. We used Net Reclassification Improvement (NRI; [33]) to compare lung cancer predictions made by appropriately calibrated models with and without genetic variables. Base models estimated individual risks by means of logistic regression and a risk factor score that tallied points for the following four risk factors: duration of smoking, age, cigarettes per day, and smoking status (Supplemental Table 2). To recalibrate a prediction model, we reweighted case and control observations as desired and then re-estimated the intercept term in logistic regression with the logit of case-control model predictions entered as an offset. Statistical analyses were completed in SAS 9.4 (SAS Institute Inc., Cary, NC).
Results
Selected characteristics of the study population are presented in Table 2. Women were similarly represented in the case and control groups. The cases were significantly older than the controls, and had accrued more cigarette pack-years. The majority of the cases were diagnosed with non-small cell lung cancer; adenocarcinoma and squamous cell carcinoma were the most commonly observed histological types.
Table 2. Selected characteristics of the study population stratified by case-control status.
| Cases | Controls | Pc | |||
|---|---|---|---|---|---|
|
|
|
||||
| N | % | N | % | ||
| N | 778 | 100.0 | 1166 | 100.0 | |
| Sex | 0.85 | ||||
| male | 405 | 52.1 | 602 | 51.6 | |
| female | 373 | 47.9 | 564 | 48.4 | |
| Age, in yearsa | <.0001 | ||||
| 45-59 | 142 | 18.3 | 665 | 57.0 | |
| 60-69 | 295 | 37.9 | 392 | 33.6 | |
| ≥80 | 341 | 43.8 | 109 | 9.3 | |
| Pack-years | <0.01 | ||||
| <35 | 183 | 23.5 | 292 | 25.0 | |
| 35-49 | 166 | 21.3 | 300 | 25.7 | |
| 50-64 | 186 | 23.9 | 298 | 25.6 | |
| ≥65 | 243 | 31.2 | 276 | 23.7 | |
| GOLDb | <.0001 | ||||
| no limitation | 236 | 34.2 | 652 | 55.9 | |
| mild | 125 | 18.1 | 165 | 14.2 | |
| moderate | 238 | 34.5 | 265 | 22.7 | |
| severe | 91 | 13.2 | 84 | 7.2 | |
| Histology | |||||
| Adenocarcinoma | 350 | 45.0 | |||
| Squamous cell | 271 | 34.8 | |||
| Adenosquamous | 29 | 3.7 | |||
| Large cell | 39 | 5.0 | |||
| Non-small cell | 46 | 5.9 | |||
| Small cell | 43 | 5.5 | |||
LEGEND: GOLD – Global Initiative for Chronic Obstructive Lung Disease severity of airflow limitation.[43]
Age at diagnosis and age at PLuSS enrollment for cases and controls, respectively.
Eighty-eight cases were missing spirometry data that predated lung cancer diagnosis.
Differences in characteristics between cases and controls were assessed using Chi-square tests.
Stepwise logistic regression selected two, three, and two SNPs in 5p15.33 (rs2736100 and rs7727912), 6p21.33 (rs805297, rs2295663, and rs1802127), and 15q25.1 (rs8034191 and rs12440014), respectively (Table 3). The 5p15.33 decision-tree analysis produced a lung cancer class that included both the carriers of the rs7727912 minor allele and the homozygotes for the rs2736100 common allele (Supplemental Figure 1a). The 6p21.33 lung cancer class included both the carriers of the rs1802127 minor allele and the homozygotes for the rs805297 common allele (Supplemental Figure 1b). Finally, the 15q25.1 lung cancer class included rs12440014 common allele homozygotes with one or two copies of the rs8034191 minor allele (Supplemental Figure 1c). All subsequent results used these classification results from decision-tree analysis to distinguish high from low risk for lung cancer.
Table 3. SNPs identified by stepwise logistic regression, by GWAS region.
| rs number | OR | 95% CI | Ptrend |
|---|---|---|---|
| 5p15.33 | AUC 0.537 | ||
| rs2736100 | 0.90 | 0.79-1.03 | 0.13 |
| rs7727912 | 1.32 | 1.06-1.63 | 0.01 |
| 6p21.33 | AUC 0.540 | ||
| rs805297 | 0.86 | 0.75-0.98 | 0.03 |
| rs2295663 | 1.28 | 0.98-1.67 | 0.07 |
| rs1802127 | 1.41 | 0.93-2.13 | 0.11 |
| 15q25.1 | AUC 0.557 | ||
| rs8034191 | 1.13 | 0.98-1.31 | 0.08 |
| rs12440014 | 0.77 | 0.65-0.91 | <0.01 |
LEGEND: AUC – area under receiver operator characteristic curve
Control group characteristics according to decision-tree-determined risk at 5p15.33, 6p21.33 and 15q25.1 are presented in Supplemental Tables 3, 4 and 5, respectively. High-risk controls had more emphysema on CT (P=0.22, <0.01, and <0.01 for 5p15.33, 6p21.33, and 15q25.1, respectively). 21%, 43%, and 36% of controls satisfied decision-tree definitions of high risk at zero, only one, and more than one GWAS region, respectively. Cigarettes per day, pack-years, prevalence of doctor-diagnosed emphysema or bronchitis, and emphysema severity increased along with the number of GWAS regions that satisfied decision-tree definitions of high risk (Table 4).
Table 4. Characteristics of controls (N=1113) according to number of high-risk GWAS regionsa.
| Zero | One | >One | |||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|||||
| N | % | N | % | N | % | Pc | |
| N | 231 | 100.0 | 478 | 100.0 | 404 | 100.0 | |
| Year | 0.36 | ||||||
| 2002 | 73 | 31.6 | 127 | 26.6 | 115 | 28.5 | |
| 2003 | 89 | 38.5 | 219 | 45.8 | 184 | 45.5 | |
| 2004-2005 | 69 | 29.9 | 132 | 27.6 | 105 | 26.0 | |
| Sex | 0.22 | ||||||
| male | 127 | 55.0 | 254 | 53.1 | 196 | 48.5 | |
| female | 104 | 45.0 | 224 | 46.9 | 208 | 51.5 | |
| Age, in years | 0.87 | ||||||
| 50-59 | 131 | 56.7 | 284 | 59.4 | 227 | 56.2 | |
| 60-69 | 79 | 34.2 | 151 | 31.6 | 136 | 33.7 | |
| 70-79 | 21 | 9.1 | 43 | 9.0 | 41 | 10.1 | |
| Family historyb | 1.00 | ||||||
| no | 192 | 83.8 | 398 | 84.0 | 337 | 83.8 | |
| yes | 37 | 16.2 | 76 | 16.0 | 65 | 16.2 | |
| Symptoms, N | 0.07 | ||||||
| 0 | 68 | 29.4 | 165 | 34.5 | 137 | 33.9 | |
| 1 | 67 | 29.0 | 148 | 31.0 | 104 | 25.7 | |
| 2 | 67 | 29.0 | 103 | 21.5 | 91 | 22.5 | |
| 3 | 29 | 12.6 | 62 | 13.0 | 72 | 17.8 | |
| Duration of smoking, in years | 0.73 | ||||||
| <30 | 10 | 4.3 | 34 | 7.1 | 30 | 7.4 | |
| 30-39 | 109 | 47.2 | 221 | 46.2 | 177 | 43.8 | |
| 40-49 | 83 | 35.9 | 172 | 36.0 | 147 | 36.4 | |
| ≥50 | 29 | 12.6 | 51 | 10.7 | 50 | 12.4 | |
| Cigarettes/day | 0.005 | ||||||
| <20 | 89 | 38.5 | 131 | 27.4 | 100 | 24.8 | |
| 20-29 | 88 | 38.1 | 214 | 44.8 | 168 | 41.6 | |
| 30-39 | 34 | 14.7 | 85 | 17.8 | 80 | 19.8 | |
| ≥40 | 20 | 8.7 | 48 | 10.0 | 56 | 13.9 | |
| Pack-years | 0.020 | ||||||
| <35 | 76 | 32.9 | 115 | 24.1 | 87 | 21.5 | |
| 35-49 | 52 | 22.5 | 134 | 28.0 | 101 | 25.0 | |
| 50-64 | 56 | 24.2 | 125 | 26.2 | 103 | 25.5 | |
| ≥65 | 47 | 20.3 | 104 | 21.8 | 113 | 28.0 | |
| COPD | 0.005 | ||||||
| no | 194 | 84.0 | 391 | 81.8 | 301 | 74.5 | |
| yes | 37 | 16.0 | 87 | 18.2 | 103 | 25.5 | |
| Emphysema | <.0001 | ||||||
| none | 137 | 59.3 | 280 | 58.6 | 199 | 49.3 | |
| trace | 53 | 22.9 | 96 | 20.1 | 68 | 16.8 | |
| mild | 29 | 12.6 | 57 | 11.9 | 72 | 17.8 | |
| mod-severe | 12 | 5.2 | 45 | 9.4 | 65 | 16.1 | |
| GOLD | 0.059 | ||||||
| no limitation | 140 | 60.6 | 267 | 55.9 | 214 | 53.0 | |
| mild | 36 | 15.6 | 72 | 15.1 | 48 | 11.9 | |
| moderate | 40 | 17.3 | 110 | 23.0 | 103 | 25.5 | |
| severe | 15 | 6.5 | 29 | 6.1 | 39 | 9.7 | |
LEGEND: Symptoms – symptom (cough, phlegm, wheeze) count; COPD - doctor diagnosis of emphysema or bronchitis; GOLD – Global Initiative for Chronic Obstructive Lung Disease severity of airflow limitation.[42]
Fifty-three subjects could not be classified because of missing genotype data.
First-degree relative with lung cancer (data missing in two, four, and two controls with zero, one, and >one GWAS risk marker).
Differences in characteristics between groups were assessed using Chi-square tests.
We tabulated cases and controls, first according to a three-GWAS-region cross classification of decision-tree-determined SNP configurations associated with high vs. low lung cancer risk and then according to a three-level classified count of GWAS regions showing the high-risk SNP configuration (Table 5). Table 5 shows, for example, a three-fold higher ratio of cases to controls among subjects with a high-risk genotype in every one as opposed to none of the three GWAS regions (odds ratio 3.14, 95% confidence interval 2.02-4.88, adjusted for sex, age, and cigarette pack-years). Unadjusted as well as adjusted (sex, age, and pack-years), lung cancer risk increased (P<0.001) in step with the number of GWAS regions that contained a SNP configuration associated with high lung cancer risk.
Table 5. Lung cancer cases and controls, classified according to SNP configuration in three GWAS regions.
| GWAS regions cross classified | All cases (N=766) and all controls (N=1113)a | PLuSScase (N=168) and random control subsets (N=816)b | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||||||||
| 15q25.1 | 6p21.33 | 5p15.33 | Case | Ctrl | OR | 95% CI | ORa | 95% CI | Case | Ctrl | OR | 95% CI | ORb | 95% CI | |
| LO | LO | LO | 109 | 231 | Ref. | Ref. | 22 | 168 | Ref. | Ref. | |||||
| LO | LO | HI | 86 | 118 | 1.54 | 1.08-2.21 | 1.57 | 1.04-2.36 | 23 | 86 | 2.04 | 1.08-3.87 | 1.90 | 0.98-3.71 | |
| LO | HI | LO | 108 | 185 | 1.24 | 0.89-1.72 | 1.19 | 0.82-1.73 | 23 | 143 | 1.23 | 0.66-2.30 | 1.16 | 0.60-2.22 | |
| LO | HI | HI | 80 | 122 | 1.39 | 0.97-2.00 | 1.33 | 0.88-2.01 | 21 | 87 | 1.84 | 0.96-3.54 | 1.56 | 0.79-3.07 | |
| HI | LO | LO | 115 | 175 | 1.39 | 1.00-1.93 | 1.66 | 1.14-2.40 | 23 | 126 | 1.39 | 0.74-2.61 | 1.36 | 0.71-2.60 | |
| HI | LO | HI | 83 | 100 | 1.76 | 1.22-2.55 | 1.84 | 1.21-2.80 | 19 | 71 | 2.04 | 1.04-4.01 | 1.95 | 0.97-3.95 | |
| HI | HI | LO | 100 | 112 | 1.89 | 1.33-2.69 | 2.07 | 1.38-3.09 | 20 | 80 | 1.91 | 0.99-3.70 | 1.76 | 0.88-3.52 | |
| HI | HI | HI | 85 | 70 | 2.57 | 1.74-3.80 | 3.14 | 2.02-4.88 | 17 | 55 | 2.36 | 1.17-4.76 | 2.48 | 1.19-5.18 | |
| Countc | All cases (N=766) and all controls (N=1113)a | PLuSScase (N=168) and random control subsets (N=816)b | |||||||||||||
|
|
|
||||||||||||||
| Case | Ctrl | OR | 95% CI | ORa | 95% CI | Case | Ctrl | OR | 95% CI | ORb | 95% CI | ||||
|
|
|||||||||||||||
| 0 | 109 | 231 | Ref. | Ref. | 22 | 168 | Ref. | Ref. | |||||||
| 1 | 309 | 478 | 1.37 | 1.05-1.79 | 1.45 | 1.07-1.96 | 69 | 355 | 1.48 | 0.89-2.48 | 1.41 | 0.83-2.41 | |||
| 2 | 263 | 334 | 1.67 | 1.26-2.21 | 1.72 | 1.25-2.36 | 60 | 238 | 1.93 | 1.14-3.26 | 1.74 | 1.01-3.01 | |||
| 3 | 85 | 70 | 2.57 | 1.74-3.80 | 3.14 | 2.02-4.88 | 17 | 55 | 2.36 | 1.17-4.76 | 2.48 | 1.19-5.18 | |||
|
|
|||||||||||||||
| 0 | 109 | 231 | Ref. | Ref. | 22 | 168 | Ref. | Ref. | |||||||
| 1 | 309 | 478 | 1.37 | 1.05-1.79 | 1.44 | 1.07-1.96 | 69 | 355 | 1.48 | 0.89-2.48 | 1.41 | 0.83-2.41 | |||
| ≥2 | 348 | 404 | 1.83 | 1.40-2.39 | 1.94 | 1.43-2.64 | 77 | 293 | 2.01 | 1.21-3.34 | 1.86 | 1.10-3.17 | |||
LEGEND: PLuSS – Pittsburgh Lung Screening Study; Ctrl – control; LO – decision-tree-determined SNP configuration associated with low lung cancer risk; HI – decision-tree-determined SNP configuration associated with high lung cancer risk; OR – odds ratio; ORa – odds ratio adjusted of sex, age (three groups), and pack-years (four groups); ORb – odds ratio adjusted for a risk factor score that tallied points for duration of smoking, age, cigarettes per day, and smoking status (Supplemental Table 2)
Twelve cases and 53 controls could not be classified because of missing genotype data.
Ten cases and 45 controls could not be classified because of missing genotype data.
Count of GWAS regions with SNP configurations associated with high lung cancer risk.
We subsequently used the prospective data from PLuSS to make externally valid inferences about the contributions from genetic variables to risk prediction. Associations between the GWAS variables and lung cancer were preserved, even after adjustments for a score that added together points for the four risk factors, duration of smoking, age, cigarettes per day, and smoking status (Table 5; Supplemental Table 2). Adding a 3-level high-risk GWAS-region count variable (0, 1, and ≥2 high-risk GWAS regions) to the risk factor score-only model increased area under the receiver operator characteristic curve (AUC) by 0.008 units (0.717 to 0.725; P=0.056). We recalibrated both the risk factor score-only model and the risk factor score plus GWAS-region count variable model to four events per 100 (absolute 6-year lung cancer incidence observed in PLuSS; data not shown). Across low (<3.0%), intermediate (3.0-4.9%), and high (≥5.0%) 6-year risk categories, adding the GWAS-region count variable improved net reclassification among cases by 7.7% (100∙[(3.6+0.7+0.0)-(0.7+0.5 +0.0)]⁄40.0) and worsened net reclassification among non-cases by 2.6% (100∙[(21.2+28.1+0.0)-(51.8+22.4+0.0)]⁄960.0; Overall NRI = 0.077 – 0.026 = 0.052; Table 6).
Table 6. Theoretical cohort of 1000 persons with 4% average lung cancer risk, classified according to lung cancer outcome and lung cancer risk as predicted by risk factor-onlya and risk factor plus GWAS-region count variable models.
| Risk factor plus GWAS prediction | Risk factor-only prediction | Total | |||
|---|---|---|---|---|---|
|
| |||||
| ≥0.05 | ≥0.03 <0.05 | <0.03 | |||
| Lung cancer | ≥0.05 | 20.2 | 3.6 | 0.0 | 23.8 |
| ≥0.03 <0.05 | 0.7 | 6.4 | 0.7 | 7.9 | |
| <0.03 | 0.0 | 0.5 | 7.9 | 8.3 | |
| Total | 21.0 | 10.5 | 8.6 | 40.0 | |
| No lung cancer | ≥0.05 | 198.8 | 51.8 | 0.0 | 250.6 |
| ≥0.03 <0.05 | 21.2 | 154.1 | 22.4 | 197.6 | |
| <0.03 | 0.0 | 28.2 | 483.5 | 511.8 | |
| Total | 220.0 | 234.1 | 505.9 | 960.0 | |
Predictions based on a risk factor score that tallied points for duration of smoking, age, cigarettes per day, and smoking status (Supplemental Table 2)
Discussion
Having genotyped 778 cases and 1166 controls for 77 SNPs located in regions identified by GWAS of lung cancer or COPD, our methods built on six SNPs, two from each of the following three regions, 5p15.33, 6p21.33, and 15q25.1. These three regions have withstood the test of time as the source of valid markers of susceptibility to lung cancer in white populations,[7] and contain cancer-relevant genes that include telomerase reverse transcriptase (TERT; essential for telomerase production and maintenance of telomeres) and CLPTM1-like (CLPTM1L; reported to protect against apoptosis [34]) in 5p15.33, BCL2-associated athanogene 6 (BAG6; required for p53-mediated response to DNA damage and stress) and mutS homolog 5 (MSH5; involved in DNA mismatch repair) in 6p21.33, and the nicotinic cholinergic receptor genes (CHRNA3, CHRNA4, CHRNA5; associated with nicotine addiction and smoking cessation) in 15q25.1. Using ancillary information provided by the lung cancer-free controls, our results shown in Table 4 support associations previously reported between GWAS-identified markers of lung cancer risk and other factors capable of mediating the lung cancer effects of these inherited genetic differences. Mediating factors relevant to our study included intensity of cigarette smoking, as measured by cigarettes per day, and COPD, as measured by self-reported doctor diagnosis, emphysema on CT, and airflow limitation on spirometry. Each GWAS region appeared to contribute independently to risk, a pattern discernable by comparing subjects with high-risk genotypes in only one as opposed to none of the three GWAS regions (odds ratio (95% confidence interval) for high GWAS risk at 5p15.33 only, 6p21.33 only, and 15q25.1 only, respectively, 1.54 (1.08-2.21), 1.24 (0.89-1.72), and 1.39 (1.00-1.93); Table 5). Statistical control for case-control differences in age and smoking did not diminish these lung cancer associations with genetic risk (Table 5).
By comparing extreme categories, a high-risk genotype in every one as opposed to none of the three GWAS regions, we explored the maximum extent to which GWAS SNP classifications may discriminate between persons with and without lung cancer risk. Six percent and twenty-one percent of 1113 controls had high-risk genotypes in three and zero GWAS regions, respectively, with the ratio of cases to controls significantly elevated three fold in the former as opposed to the latter category after adjustment for sex, age, and cigarette pack years. To illustrate the potential usefulness of the lung cancer discrimination achievable with GWAS SNP classification in practice, however, we formed three prudent categories defined by presence of a high-risk genotype in none, only one, and more than one GWAS region. Adding this high-risk GWAS-region count variable to an age and smoking risk factor-only model improved lung cancer prediction, as measured by an AUC change of 0.008 and an NRIoverall of 0.052 across low, intermediate, and high risk categories. These risk categories spanned the average 6-year lung cancer risk observed in PLuSS, a representative cohort of current and former smokers.[28] To discuss these improvements in possibly more meaningful terms, we will use our theoretical cohort of 1000 persons (Table 6) to describe reclassification based on number of high-risk GWAS regions between low and intermediate risk categories, categories sensibly judged to define levels of lung cancer risk insufficient and sufficient for CT screening. Number of high-risk GWAS regions reclassified 0.7 cases from low to intermediate risk and 0.5 cases from intermediate to low risk, changes associated with a 0.5 percentage point improvement in the sensitivity of prediction (100 · (0.7 – 0.5)/40). It reclassified 28.2 controls from intermediate to low risk and 22.4 controls from low to intermediate risk, changes associated with a 0.7 percentage point improvement in the specificity of prediction (100 · (28.2 – 22.4)/960). Even for persons with genuinely borderline risk profiles, these improvements in sensitivity and specificity may not offset genotyping costs.
Using PLuSS, an illustrative cohort of current and former cigarette smokers,[28] we evaluated SNPs in GWAS-identified regions as predictors of lung cancer risk, an effort only possible with prospective data. Several previous studies have measured the improvements achieved by adding SNP genotype information to an otherwise conventional lung cancer prediction model.[35-39] However, none used prospective data to systematically evaluate SNPs from GWAS regions. Though unique in these regards, our study faced several limitations. First, the numbers of cases and controls available, particularly the number of cases nested in PLuSS, were relatively small which limited the precision of some of our estimates of association between genetic risk and lung cancer (Table 5). Second, having used statistical criteria to select six SNPs from a list of 74 candidates, we did not correct our estimates of prediction accuracy for model overfitting. Therefore, the improvements in prediction attributed to genetic variables are optimistic, potentially representing only an upper bound of the benefits that might be possible. Third, not all cases and controls came from a single well-defined source. To help select SNPs and build models, we supplemented the cases nested in PLuSS with cases seen in the clinic and added subjects with one or more benign nodules to the controls randomly selected from PLuSS. We did not, however, observe statistically significant differences in genotype frequencies between cases obtained from the clinic as opposed to PLuSS or between controls selected randomly as opposed to intentionally (data not shown). Fourth, we based estimates of net reclassification improvement on notions about current and former smokers who may or may not be appropriate candidates for lung cancer screening with low-dose CT. No universally accepted risk threshold exists for separating smokers between these two groups.[9] However, when divided at the 0.03 risk threshold by the risk-factor only model used for NRI calculation (Table 6), controls randomly selected from PLuSS split between a lower (N=450) and higher (N=411) risk group, two thirds in the former failing and nearly all (93%) in the latter satisfying minimum age and pack-year requirements for lung cancer screening, as established by the U.S. Preventive Services Task Force.[40] Finally, our base models considered a limited set of risk factors related to sex, age, or smoking. Approaches that consider additional risk factors, such as COPD, may reduce even further the incremental improvements in prediction possible through genetic information.
In conclusion, in line with what has been reported for other cancer sites,[41,42] adding genetic information from six SNPs located in genomic regions identified by GWAS to basic models that included critical risk factors related to age and smoking resulted in only small improvements in lung cancer risk prediction. The improvements observed are, unfortunately, likely too small to affect disease control practice.
Supplementary Material
Acknowledgments
Financial support: This research was supported by National Institutes of Health grants P50 CA90440 and P30 CA047904.
References
- 1.Amos CI, Wu X, Broderick P, et al. Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat Genet. 2008;40:616–22. doi: 10.1038/ng.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.McKay JD, Hung RJ, Gaborieau V, et al. Lung cancer susceptibility locus at 5p15.33. Nat Genet. 2008;40:1404–6. doi: 10.1038/ng.254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang Y, Broderick P, Webb E, et al. Common 5p15.33 and 6p21.33 variants influence lung cancer risk. Nat Genet. 2008;40:1407–9. doi: 10.1038/ng.273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Broderick P, Wang Y, Vijayakrishnan J, et al. Deciphering the impact of common genetic variation on lung cancer risk: A genome-wide association study. Cancer Res. 2009;69:6633–41. doi: 10.1158/0008-5472.CAN-09-0680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Landi MT, Chatterjee N, Yu K, et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet. 2009;85:679–91. doi: 10.1016/j.ajhg.2009.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Nguyen JDU, Lamontagne M, Couture C, et al. Susceptibility loci for lung cancer are associated with mRNA levels of nearby genes in the lung. Carcinogenesis. 2014;35:2653–9. doi: 10.1093/carcin/bgu184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yang IA, Holloway JW, Fong KM. Genetic susceptibility to lung cancer and co-morbidities. J Thorac Dis. 2013;5(Suppl 5):S454–62. doi: 10.3978/j.issn.2072-1439.2013.08.06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Alberg AJ, Brock MV, Ford JG, Samet JM, Spivack SD. Epidemiology of lung cancer: Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. Chest. 2013;143:e1S–29S. doi: 10.1378/chest.12-2345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Field JK, Chen Y, Marcus MW, McRonald FE, Raji OY, Duffy SW. The contribution of risk prediction models to early detection of lung cancer. J Surg Oncol. 2013;108:304–11. doi: 10.1002/jso.23384. [DOI] [PubMed] [Google Scholar]
- 10.Spitz MR, Hong WK, Amos CI, et al. A risk model for prediction of lung cancer. J Natl Cancer Inst. 2007;99:715–26. doi: 10.1093/jnci/djk153. [DOI] [PubMed] [Google Scholar]
- 11.Cassidy A, Myles JP, van Tongeren M, et al. The LLP risk model: an individual risk prediction model for lung cancer. Br J Cancer. 2008;98:270–6. doi: 10.1038/sj.bjc.6604158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Maisonneuve P, Bagnardi V, Bellomi M, et al. Lung cancer risk prediction to select smokers for screening CT--a model based on the Italian COSMOS trial. Cancer Prev Res. 2011;4:1778–89. doi: 10.1158/1940-6207.CAPR-11-0026. [DOI] [PubMed] [Google Scholar]
- 13.Tammemagi CM, Pinsky PF, Caporaso NE, et al. Lung cancer risk prediction: Prostate, Lung, Colorectal And Ovarian Cancer Screening Trial models and validation. J Natl Cancer Inst. 2011;103:1058–68. doi: 10.1093/jnci/djr173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kovalchik SA, Tammemagi M, Berg CD, et al. Targeting of low-dose CT screening according to the risk of lung-cancer death. N Engl J Med. 2013;369:245–54. doi: 10.1056/NEJMoa1301851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Tammemagi MC, Katki HA, Hocking WG, et al. Selection criteria for lung-cancer screening. N Engl J Med. 2013;368:728–36. doi: 10.1056/NEJMoa1211776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tammemagi MC, Lam S. Screening for lung cancer using low dose computed tomography. BMJ. 2014;348:g2253. doi: 10.1136/bmj.g2253. [DOI] [PubMed] [Google Scholar]
- 17.Proposed decision memo for screening for lung cancer with low dose computed tomography. [Accessed January 19, 2015];Centers for Medicare & Medicaid Services. 2014 at http://www.cms.gov/medicare-coverage-database/details/nca-proposed-decision-memo.aspx?NCAId=274.
- 18.Cho MH, Boutaoui N, Klanderman BJ, et al. Variants in FAM13A are associated with chronic obstructive pulmonary disease. Nat Genet. 2010;42:200–2. doi: 10.1038/ng.535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hancock DB, Eijgelsheim M, Wilk JB, et al. Meta-analyses of genome-wide association studies identify multiple loci associated with pulmonary function. Nat Genet. 2010;42:45–52. doi: 10.1038/ng.500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hung RJ, McKay JD, Gaborieau V, et al. A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature. 2008;452:633–7. doi: 10.1038/nature06885. [DOI] [PubMed] [Google Scholar]
- 21.Li Y, Sheu CC, Ye Y, et al. Genetic variants and risk of lung cancer in never smokers: A genome-wide association study. Lancet Oncol. 2010;11:321–30. doi: 10.1016/S1470-2045(10)70042-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu P, Vikis HG, Wang D, et al. Familial aggregation of common sequence variants on 15q24-25.1 in lung cancer. J Natl Cancer Inst. 2008;100:1326–30. doi: 10.1093/jnci/djn268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Pillai SG, Ge D, Zhu G, et al. A genome-wide association study in chronic obstructive pulmonary disease (COPD): Identification of two major susceptibility loci. PLoS Genet. 2009;5:e1000421. doi: 10.1371/journal.pgen.1000421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rafnar T, Sulem P, Besenbacher S, et al. Genome-wide significant association between a sequence variant at 15q15.2 and lung cancer risk. Cancer Res. 2011;71:1356–61. doi: 10.1158/0008-5472.CAN-10-2852. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Repapi E, Sayers I, Wain LV, et al. Genome-wide association study identifies five loci associated with lung function. Nat Genet. 2010;42:36–44. doi: 10.1038/ng.501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wilk JB, Chen TH, Gottlieb DJ, et al. A genome-wide association study of pulmonary function measures in the Framingham Heart Study. PLoS Genet. 2009;5:e1000429. doi: 10.1371/journal.pgen.1000429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Young RP, Hopkins RJ, Whittington CF, Hay BA, Epton MJ, Gamble GD. Individual and cumulative effects of GWAS susceptibility loci in lung cancer: Associations after sub-phenotyping for COPD. PLoS One. 2011;6:e16476. doi: 10.1371/journal.pone.0016476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wilson DO, Weissfeld JL, Fuhrman CR, et al. The Pittsburgh Lung Screening Study (PLuSS): Outcomes within 3 years of a first computed tomography scan. Am J Respir Crit Care Med. 2008;178:956–61. doi: 10.1164/rccm.200802-336OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wilson DO, Weissfeld JL, Balkan A, et al. Association of radiographic emphysema and airflow obstruction with lung cancer. Am J Respir Crit Care Med. 2008;178:738–44. doi: 10.1164/rccm.200803-435OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yu Z, Schaid DJ. Methods to impute missing genotypes for population data. Hum Genet. 2007;122:495–504. doi: 10.1007/s00439-007-0427-y. [DOI] [PubMed] [Google Scholar]
- 31.Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: An update. SIGKDD Explorations. 2009;11 [Google Scholar]
- 32.Quinlan JR. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers; 1992. [Google Scholar]
- 33.Leening MJ, Vedder MM, Witteman JC, Pencina MJ, Steyerberg EW. Net reclassification improvement: computation, interpretation, and controversies: A literature review and clinician's guide. Ann Intern Med. 2014;160:122–31. doi: 10.7326/M13-1522. [DOI] [PubMed] [Google Scholar]
- 34.James MA, Wen W, Wang Y, Byers LA, Heymach JV, Coombes KR, Girard L, Minna J, You M. Functional Characterization of CLPTM1L as a Lung Cancer Risk Candidate Gene in the 5p15.33 Locus. PLoS One. 2012;7(6):e36116. doi: 10.1371/journal.pone.0036116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Young RP, Hopkins RJ, Hay BA, et al. A gene-based risk score for lung cancer susceptibility in smokers and ex-smokers. Postgrad Med J. 2009;85:515–24. doi: 10.1136/pgmj.2008.077107. [DOI] [PubMed] [Google Scholar]
- 36.Raji OY, Agbaje OF, Duffy SW, Cassidy A, Field JK. Incorporation of a genetic factor into an epidemiologic model for prediction of individual risk of lung cancer: the Liverpool Lung Project. Cancer Prev Res. 2010;3:664–9. doi: 10.1158/1940-6207.CAPR-09-0141. [DOI] [PubMed] [Google Scholar]
- 37.Buch SC, Diergaarde B, Nukui T, et al. Genetic variability in DNA repair and cell cycle control pathway genes and risk of smoking-related lung cancer. Mol Carcinog. 2012;51(Suppl 1):E11–20. doi: 10.1002/mc.20858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Li H, Yang L, Zhao X, et al. Prediction of lung cancer risk in a Chinese population using a multifactorial genetic model. BMC Med Genet. 2012;13:118. doi: 10.1186/1471-2350-13-118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Spitz MR, Amos CI, Land S, et al. Role of selected genetic variants in lung cancer risk in African Americans. J Thorac Oncol. 2013 Apr;8(4):391–7. doi: 10.1097/JTO.0b013e318283da29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Lung Cancer: Screening. U.S. Preventive Services Task Force. [Accessed December 11, 2014];2013 at http://www.uspreventiveservicestaskforce.org/Page/Topic/recommendation-summary/lung-cancer-screening.
- 41.Gail MH. Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. J Natl Cancer Inst. 2008;100:1037–41. doi: 10.1093/jnci/djn180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Wacholder S, Hartge P, Prentice R, et al. Performance of common genetic variants in breast-cancer risk models. N Engl J Med. 2010;362:986–93. doi: 10.1056/NEJMoa0907727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.COPD Diagnosis and Management At-A-Glance Desk Reference. [Accessed December 9, 2014];2014 at http://www.goldcopd.org/guidelines-copd-diagnosis-and-management.html.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
