Abstract
Background:
Self-reported smoking may not fully capture individualized risk of smoking-related cancer. Circulating proteins may reflect biological consequences of smoking. Thus, we developed a score from smoking-related proteins and evaluated its association with smoking-related cancer.
Methods:
This prospective cohort study included 10,563 participants aged 47-70 years in the Atherosclerosis Risk in Communities study. Plasma proteins were measured by SomaScan. The score was constructed from proteins associated with current smoking, packyears, and/or recent quitting identified by linear regression and elastic net regression. Cox regression was used to estimate adjusted hazard ratios (aHR) and 95% confidence intervals (CI). We confirmed the association in a case-cohort study in the European Prospective Investigation into Cancer and Nutrition (EPIC).
Results:
aHRs comparing score quartiles Q4 to Q1 for total incidence and mortality of 13 smoking-related cancers were 3.89 (95% CI 3.06-4.96) and 5.73 (95% CI 4.08-8.06) before, and 2.28 (95% CI 1.65-3.15) and 2.07 (95% CI 1.74-4.10) after adjusting for self-reported smoking. aHRs for lung cancer were 12.1 (95% CI 7.11-20.6) and 14.2 (95% CI 7.58-26.8) before, 3.04 (95% CI 1.59-5.81) and 4.12 (95% CI 1.99-8.53) after adjusting. In EPIC, aHRs for lung cancer were 9.47 (95% CI 6.82-13.15) before and 2.23 (95% CI 1.48-3.35) after adjusting.
Conclusion:
The smoking-related protein score provided relative risk information for smoking-associated cancers beyond self-reported smoking, which was confirmed in an independent cohort. Such a score may be considered for use in risk stratification for prevention and cancer screening in settings in which detailed smoking history cannot be obtained.
Keywords: smoking, lung cancer, proteomics, risk
Introduction
Smoking causes lung cancer and 12 other types of cancer.1,2 Self-report is the standard approach to assess smoking history, but can be inaccurate.2-4 People with the same self-reported smoking history may have different risks of smoking-related cancers due to differences in smoking behaviors, such as depth and intensity of inhalation. Moreover, the frequency, intensity, and duration that an individual reports to have smoked may not perfectly correlate with the internal dose or the biologically effective dose2,5 due to interindividual differences in the absorption and metabolism of tobacco constituents. Many tobacco carcinogens need to be first transformed and metabolized into reactive intermediates, which may bind to macromolecules, such as DNA and proteins.2 These alterations could affect production and function of these macromolecules. Plasma proteins may provide one such measure of biologically effective dose, and thus risk of cancer development and progression.
Studies have reported associations between smoking and circulating concentrations of numerous proteins using large-scale proteomics.6,7 To our knowledge, no studies have assessed detailed smoking history in relation to large-scale proteomics to generate a smoking-related protein score. Further, no large, prospective cohort study has investigated the association of such score with risk of cancers known to be caused by smoking. Hence, there is a need to identify a parsimonious set of proteins that are associated with detailed smoking history, generate a score, and evaluate the association of the score with risk of smoking-related cancer, including with lethal potential. If associated with smoking-related cancers, such a score could be evaluated for utility in risk stratification for personalized cancer prevention and screening recommendations.
For lung cancer, improved and equitable risk stratification for prevention and screening has been called for.8 The National Lung Screening Trial documented the benefit of low-dose computed tomography (CT) in reducing lung cancer mortality.9 The US Preventive Services Task Force recommends that persons 50-80 years old who smoked ≥20 packyears and either still smoke or quit within 15 years should receive annual low-dose computed tomography for lung cancer screening10. However, this recommendation does not consider interindividual variability in the biologically effective dose, and thus, potentially differences in risk among persons with the same smoking history. In addition, that not all smokers develop lung cancer suggests that some screen-eligible smokers under current guidelines would not benefit from screening but are at risk for experiencing screening-associated harms. Thus, an enhanced risk stratification tool is needed to tailor whether and how often to screen to achieve greatest benefit for those at highest risk and achieve the lowest harm to those at lowest risk.
Thus, we developed a smoking-related plasma protein score, and assessed its association with incidence and mortality from smoking-related cancers before and after adjusting for self-reported smoking information in a prospective cohort study in the Atherosclerosis Risk in Communities (ARIC) study.11,12 We hypothesized that a combination of smoking-related proteins provides biological information about cancer risk beyond self-reported smoking. In the European Prospective Investigation into Cancer and Nutrition (EPIC),13,14 we sought to confirm the score’s association with lung cancer risk. If further confirmed, we expect that this score could better inform risk stratification for prevention and screening for lung cancer, and possibly other smoking-related cancers.
Methods
Study participants
ARIC recruited 15,792 mostly Black and White male and female participants aged 45-64 years during 1987-1989 from Jackson, MS; Washington County, MD; Minneapolis, MN; and Forsyth County, NC. Participants were invited to multiple follow-up visits after baseline. Lifestyle, medical factors and blood samples were collected.11 Participants provided informed consent. Study protocols were approved by the Institutional Review Boards of participating centers. After exclusions (Figure S1), we divided participants evenly (50/50) into Set 1 for score development and Set 2 for evaluating the association between the score and outcomes.
Protein measurements
Proteomic profiling was done using a DNA-aptamer based method (SomaScan® 5K) in plasma from Visits 2, 3 and 5.15 The method is highly sensitive and reliable (median CV=1.2%). Measures were normalized with pooled healthy control samples and calibrated with standard processes.16 Visit 2 protein levels passing quality control as described previously16 were used for the main analysis. Visit 3 and 5 protein levels were used in the internal confirmation analysis.
Step 1 - Identification of smoking-related proteins and score construction
In Set 1, using linear regression, adjusting for age, sex, and race-study center we evaluated associations between log2-transformed levels of 4,697 proteins (outcome) detected by 4,877 aptamers and each smoking variable (exposure) in a separate model: status (a term for current, a term for former, and the reference was never smoker), cumulative exposure (continuous, packyears), and time since quitting (a term for quitting within 3 years, a term for quitting more than 3 years ago, and the reference was never smokers). For proteins passing Bonferroni correction (0.05/4,877) for current smoking (vs never), packyears, and recent quitting (vs never), we next ran elastic net regression adjusting for sex, age, race-study center to remove highly correlated proteins. We repeated this model with resampling 100 times. Only proteins identified in ≥90 out of 100 iterations were considered. Fewer proteins were related to recent quitting, thus, we included proteins that appeared in ≥50 out of 100 iterations. For proteins associated with >1 smoking variable, we counted that protein only once in the score construction.
We constructed the score as follows: for proteins positively associated with smoking (i.e., higher protein level associated with current smoking, higher packyears, or recent quitting) participants with levels ≥median were assigned the weight of 1; for proteins inversely associated with smoking, those with levels <median were assigned the weight of 1; otherwise, 0. The final score was the sum across the weights from all selected proteins.
Step 2: Association analysis
We considered incidence and mortality from 13 smoking-associated cancers combined: lung/bronchus, larynx, head and neck, esophagus, bladder, kidney, liver, acute myeloid leukemia, stomach, pancreas, colon, rectum, and cervix.2 We also considered lung cancer separately. Incident cancers were ascertained from state cancer registries supplemented with medical records and cancer mortality from death certificates12 through 2015. Among 5,282 participants, we ascertained 672 and 424 smoking-associated cancer cases and deaths, respectively. Of these, 233, 141, and 210 were lung cancer, non-small cell lung cancer (NSCLC) and deaths from lung cancer, respectively. Because the score was developed agnostic to outcomes, to confirm its broad applicability, we also considered all-cause mortality (n=2,699) and myocardial infarction and/or fatal coronary heart disease17 (MI/fatal CHD, n=783).
Covariates were demographics (age at blood collection, race-study center, sex); cancer risk or protective factors and/or factors that might influence circulating protein levels (post-menopausal hormone use, alcohol drinking, body mass index (BMI), diabetes [diagnosed diabetes defined as self-reported physician diagnosis or used diabetes drugs, undiagnosed diabetes defined as those without diagnosed diabetes having fasting glucose ≥126 mg/dL or non-fasting glucose ≥200 mg/dL, at risk for diabetes defined as having fasting glucose 100 to <126 mg/dL or non-fasting glucose 140 to <200 mg/dL, normoglycemic], hypertension medication use, cholesterol status [treated high cholesterol defined as used cholesterol-lowering drugs, undiagnosed-high cholesterol defined as total cholesterol >6.18 mmol/L, undiagnosed-borderline high cholesterol defined as total cholesterol 5.18-6.18 mmol/L, and normal cholesterol], eGFR). We also considered self-reported smoking information (current, former [both versus never], packyears) and DNA methylation-predicted packyears18,19.
In Set 2, Cox proportional hazards regression was used to estimate adjusted hazard ratios (aHR) and 95% confidence intervals (CI) for smoking-related cancer incidence and mortality, lung cancer incidence and mortality, NSCLC incidence, non-lung smoking-associated cancer incidence and mortality, MI/fatal CHD and all-cause mortality. We ran the following: Model 1: score (in quartiles, with cutpoints defined in Set 2) + demographic factors; Model 2: Model 1 + cancer risk or protective factors and/or factors that might influence circulating protein levels; Model 3: Model 1 + self-reported smoking information; Model 4: Model 2 + self-reported smoking information; Model 5: Model 4 + DNA methylation-predicted packyears (among 1425 participants with methylation data). We also ran Models 3 and 4 including terms for former-quit>3 years ago and former-quit≤3 years ago instead of a former smoker term; associations for the score were comparable, thus, we included only the former smoker term. We repeated analyses within strata of race (Black, White), sex (female, male), and smoking status (current, former, ever, never). We assessed the discrimination of the score (C-statistic20) before and after including self-reported smoking information.
To internally confirm the associations between the smoking-related proteins and smoking-related cancer incidence and mortality, we recalculated the score with proteins measured at Visit 3 (3 years later) and Visit 5 (21 years later) and repeated the association analyses.
Statistical analyses were conducted using R Project for Statistical Computing version 4.2.3. Hypothesis tests were 2-sided with α=0.05.
Confirmation cohort
In EPIC,13,14 we analyzed case-cohort data consisting of 545 incident lung cancer cases and 4,115 subcohort non-cases from the United Kingdom, The Netherlands, Spain, and Italy. Median follow-up was 17.6 years. Local ethical committees in participating countries and the IARC ethical committee approved the study. All participants provided written informed consent. The score was calculated from plasma proteins measured using the SomaScan® 7K Assay, which includes all proteins measured in ARIC. HRs and 95% CIs were estimated for score quartiles using age as the time scale, adjusting for sex and country, and further adjusting for self-reported smoking information.
Results
Participant characteristics
In Set 2 (n=5,282), mean age was 57 years (SD: 5.73), 45% were male, 24% were Black, and median follow-up was 22.9 years. 2,117 (40%) were self-reported never smokers, 1,179 (22%) were current smokers, and 1,986 (38%) were former smokers (Table 1). Age, race-center, sex, alcohol drinking, BMI, diabetes, hypertension medication use, cholesterol-lowering medication use, and eGFR were significantly different by smoking status. Ever smokers were more likely to be male, ever drinkers, have higher eGFR, and were less likely to be obese. Baseline characteristics were comparable in Set 1 (n=5,281).
Table 1.
Characteristics by self-reported smoking status, Set 21, ARIC, 1990-1992.
| Overall, N = 5,282 |
Never Smokers, N = 2,117 |
Former Smokers, N = 1,986 |
Current Smokers, N = 1,179 |
P-value2 | |
|---|---|---|---|---|---|
| Age, years (median, Q1-Q3) | 57.0 (52.4, 62.2) | 57.0 (52.4, 62.2) | 57.6 (52.8, 62.8) | 56.0 (51.8, 61.1) | <.001 |
| Race-Study Center | <.001 | ||||
| Black-Forsyth Co., NC | 2.7% | 2.4% | 1.5% | 5.3% | |
| Black-Jackson, MS | 21% | 25% | 17% | 23% | |
| White-Forsyth Co., NC | 24% | 22% | 24% | 27% | |
| White-Minnesota, MN | 27% | 24% | 32% | 23% | |
| White-Washington Co., MD | 26% | 27% | 27% | 21% | |
| Sex | <.001 | ||||
| Female | 55% | 72% | 41% | 49% | |
| Male | 45% | 28% | 59% | 51% | |
| Post-menopausal Hormone Use (Women only) | <.001 | ||||
| Never | 39% | 42% | 36% | 38% | |
| Ever | 61% | 58% | 64% | 62% | |
| Packyear Category | <.001 | ||||
| Never Smoker | 40% | 100% | 0% | 0% | |
| >0 to 10 | 17% | 0% | 39% | 9.5% | |
| >10 to 20 | 11% | 0% | 20% | 14% | |
| >20 to 30 | 10% | 0% | 16% | 18% | |
| >30 | 22% | 0% | 25% | 58% | |
| Alcohol Drinking Status | <.001 | ||||
| Never | 23% | 41% | 11% | 12% | |
| Former | 57% | 45% | 65% | 64% | |
| Current | 20% | 14% | 25% | 24% | |
| BMI Category | <.001 | ||||
| <25 kg/m2 | 31% | 29% | 27% | 42% | |
| 25 to <30 kg/m2 | 40% | 37% | 42% | 39% | |
| ≥30 kg/m2 | 29% | 34% | 30% | 19% | |
| Diabetes Status | <.001 | ||||
| No diabetes | 35% | 39% | 33% | 32% | |
| At risk for diabetes2 | 48% | 44% | 49% | 53% | |
| Undiagnosed diabetes2 | 7.2% | 6.8% | 8.0% | 6.7% | |
| Diagnosed diabetes2 | 9.4% | 10% | 9.3% | 8.1% | |
| Hypertension Medication Use | 33% | 34% | 34% | 30% | .049 |
| Cholesterol Status | <.001 | ||||
| Normal | 41% | 40% | 40% | 45% | |
| Untreated-total cholesterol 5.18-6.18 mmol/L | 33% | 34% | 34% | 32% | |
| Untreated-total cholesterol >6.18 mmol/L | 19% | 21% | 18% | 18% | |
| Treated high cholesterol3 | 6.5% | 5.8% | 8.2% | 4.7% | |
| eGFR (Median, Q1-Q3) | 97 (86, 107) | 98 (86, 107) | 96 (86, 106) | 97 (85, 107) | .008 |
Half of the eligible participants in Atherosclerosis Risk in Communities (ARIC) study at Visit 2 (1990-1992).
Participants were categorized as having diagnosed diabetes if they self-reported a physician diagnosis or used diabetes medications. At risk for diabetes was defined as: fasting glucose of 100 to <126 mg/dL or non-fasting glucose of 140 to <200 mg/dL. Without diagnosed diabetes was classified as undiagnosed diabetes (fasting glucose ≥126 mg/dL or non- fasting glucose ≥200 mg/dL), or as not diabetic/not at risk for diabetes.
Participants were categorized as treated cholesterol if they self-reported to use a lipid-lowering drug.
Smoking-related proteins
75, 64 and 4 proteins (116 unique) were selected for their associations with smoking status (current versus never), packyears, and quitting within 3 years in former smokers (Table S1). Of these, 27 proteins overlapped between smoking status and packyears. The top 4 proteins (by gene name) identified (by effect size) were C1QL1, ALPPL2, B4GALT1, CD93 for smoking status, WFDC2, MMP12, RSPO4, SIGLEC7 for packyears, and KL, GPR68, SLPI and NOTCH3 for quitting within 3 years in former smokers. Levels of 58 proteins were positively associated and 58 were inversely associated with self-reported smoking. Median smoking-related protein score was 54 (range: 27-101, 25th to 75th percentiles: 47-64). The maximum value for Q3 fell halfway through the observed distribution of the score. 85% of current smokers fell in the top quartile of the score, whereas only 4% of former and 2% of never smokers fell in the top quartile. 76% of never smokers and 50% of former smokers fell in the bottom two quartiles of the score, while only 4% of current smokers fell in the bottom two quartiles (Table 2).
Table 2.
Cross-tabulation of self-reported smoking status with packyear categories and smoking-related protein score quartiles, Set 21, ARIC, 1990-1992.
| Never Smokers, N = 2,117 |
Former Smokers, N = 1,986 |
Current Smokers, N = 1,179 |
|
|---|---|---|---|
| Packyears categories | |||
| Never Smokers | 100% | - | - |
| >0 to 10 | - | 39% | 10% |
| >10 to 20 | - | 20% | 14% |
| >20 to 30 | - | 16% | 18% |
| >30 | - | 25% | 58% |
| Smoking-related protein score quartiles | |||
| Q1 | 41% | 23% | 1% |
| Q2 | 35% | 27% | 3% |
| Q3 | 22% | 36% | 11% |
| Q4 | 2% | 14% | 85% |
Half of eligible participants in the Atherosclerosis Risk in Communities (ARIC) study at Visit 2 (1990-1992).
Association between the score and smoking-related cancer incidence and mortality, all-cause mortality, and MI/fatal CHD
The aHRs of smoking-related cancer incidence and mortality comparing the highest to lowest quartile of the score were 3.89 (95% CI 3.06-4.96, P-trend<.001) and 5.73 (95% CI 4.08-8.06, P-trend<.001) before (Model 2), and 2.28 (95% CI 1.65-3.15; P-trend<.001) and 2.67 (95% CI 1.74-4.10, P-trend<.001) after (Model 4) adjusting for self-reported smoking information (Table 3). This pattern was the same in Models 1 and 3. After further adjusting for DNA methylation-predicted packyears (Model 5, 27% with methylation data), the score remained positively associated with smoking-related cancer incidence and mortality. In ever smokers, the score (continuous) explained 35% (34% in Set 1) and DNA-methylation-predicted packyears (continuous) explained 23% (19% in Set 1) of the variation in packyears. The adjusted cumulative incidence and mortality (Model 4) of smoking-related cancer by quartiles and age are shown in Figure S2.
Table 3.
Hazard ratios (HR) and 95% confidence intervals (CI) of the association between smoking-related protein score quartiles and incidence of and mortality from smoking-related cancers, Set 21, ARIC, 1990-1992 to 2015.
| Incidence of smoking-related cancers | Mortality from smoking-related cancers | |||||||
|---|---|---|---|---|---|---|---|---|
| Quartile of smoking-related protein score |
Cases N/ Person-Years |
HR | 95% CI | Quartile of smoking-related protein score |
Cases N/ Person-Years |
HR | 95% CI | |
| Model 1 2 | Q1 | 97/26795 | REF | Q1 | 44/29460 | REF | ||
| Q2 | 106/25852 | 1.06 | 0.80, 1.40 | Q2 | 62/28396 | 1.35 | 0.91, 1.99 | |
| Q3 | 160/23981 | 1.65 | 1.28, 2.14 | Q3 | 101/26848 | 2.14 | 1.49, 3.07 | |
| Q4 | 309/21291 | 3.93 | 3.10, 4.99 | Q4 | 217/24024 | 5.95 | 4.26, 8.31 | |
| Model 2 2 | Q1 | 97/26795 | REF | Q1 | 44/29460 | REF | ||
| Q2 | 106/25852 | 1.04 | 0.79, 1.38 | Q2 | 62/28396 | 1.31 | 0.89, 1.94 | |
| Q3 | 159/23979 | 1.58 | 1.22, 2.06 | Q3 | 101/26823 | 2.03 | 1.41, 2.92 | |
| Q4 | 309/21291 | 3.89 | 3.06, 4.96 | Q4 | 217/24024 | 5.73 | 4.08, 8.06 | |
| Model 3 2 | Q1 | 97/26795 | REF | Q1 | 44/29460 | REF | ||
| Q2 | 106/25852 | 1.03 | 0.78, 1.35 | Q2 | 62/28396 | 1.27 | 0.86, 1.88 | |
| Q3 | 160/23981 | 1.45 | 1.11, 1.90 | Q3 | 101/26848 | 1.72 | 1.19, 2.49 | |
| Q4 | 309/21291 | 2.31 | 1.67, 3.19 | Q4 | 217/24024 | 2.69 | 1.76, 4.14 | |
| Model 4 2 | Q1 | 97/26795 | REF | Q1 | 44/29460 | REF | ||
| Q2 | 106/25852 | 1.02 | 0.77, 1.34 | Q2 | 62/28396 | 1.25 | 0.85, 1.85 | |
| Q3 | 159/23979 | 1.41 | 1.08, 1.84 | Q3 | 101/26823 | 1.67 | 1.15, 2.43 | |
| Q4 | 309/21291 | 2.283 | 1.65, 3.15 | Q4 | 217/24024 | 2.67 | 1.74, 4.10 | |
| Model 5 2 | Q1 | 27/7296 | REF | Q1 | 14/7897 | REF | ||
| Q2 | 33/7023 | 1.18 | 0.70, 1.97 | Q2 | 16/7815 | 1.10 | 0.53, 2.26 | |
| Q3 | 39/6197 | 1.42 | 0.84, 2.37 | Q3 | 29/6949 | 1.83 | 0.93, 3.59 | |
| Q4 | 75/5128 | 2.24 | 1.19, 4.22 | Q4 | 56/5807 | 2.21 | 0.99, 4.94 | |
Half of the eligible participants in Atherosclerosis Risk in Communities (ARIC) study at Visit 2 (1990-1992), randomly selected for the association analyses. 13 smoking-associated cancers included: lung/bronchus, larynx, head and neck, esophagus, bladder, kidney, liver, acute myeloid leukemia, stomach, pancreas, colon, rectum, cervix. Median follow-up time is 23 years.
Model 1: Score + demographic factors. Model 2: Model 1 + cancer risk factors (age, sex, race-study center, post-menopausal hormone use, alcohol drinking status, BMI categories, diabetes status, hypertension medication use, cholesterol status, eGFR). Model 3: Model 1 + self-reported smoking information (smoking status, packyears). Model 4: Model 2 + self-reported smoking information. Model 5: Model 4 + DNA methylation-predicted packyears (among 1438 participants with methylation data).
When adjusting for terms for former-quit ≤3 years and former-quit >3 years (both vs never smoker) instead of a term for former smoker: HR=2.27, 95% CI 1.63-3.15.
The aHRs of lung cancer incidence and mortality comparing the highest to lowest quartile of the score were 12.1 (95% CI 7.11-20.6) and 14.2 (95% CI 7.58-26.8) before (Model 2), and 3.04 (95% CI 1.59-5.81) and 4.12 (95% CI 1.99-8.53) after (Model 4) adjusting for self-reported smoking information (all P-trend<.001, Table 4). A similar pattern was observed for NSCLC (Table S2). The adjusted cumulative incidence and mortality (Model 4) of lung cancer by quartiles and age are shown in Figure S3. After adjusting for methylation-predicted packyears, the positive association for the score was not attenuated (Table 4). For smoking-associated cancers excluding lung, the score was positively associated with both incidence and mortality (Table S3). The associations for smoking-related and lung cancer incidence and mortality were present among ever, former, and current smokers, and were comparable in Black and White participants and in male and female participants (Table S4). The score was also positively associated with all-cause mortality and MI/fatal CHD before and after adjustment for self-reported smoking information (Table 5).
Table 4.
Hazard ratios (HR) and 95% confidence intervals (CI) of the association between smoking-related protein score quartiles and incidence of and mortality from lung cancer, Set 21, ARIC, 1990-1992 to 2015.
| Incidence of lung cancer | Mortality from lung cancer | |||||||
|---|---|---|---|---|---|---|---|---|
| Quartile of smoking-related protein score |
Cases N/ Person-Years |
HR | 95% CI | Quartile of smoking-related protein score |
Cases N/ Person-Years |
HR | 95% CI | |
| Model 1 2 | Q1 | 16/26795 | REF | Q1 | 11/29460 | REF | ||
| Q2 | 14/25852 | 0.86 | 0.42, 1.77 | Q2 | 18/28396 | 1.57 | 0.74, 3.33 | |
| Q3 | 42/23981 | 2.74 | 1.53, 4.91 | Q3 | 43/26848 | 3.68 | 1.88, 7.18 | |
| Q4 | 161/21291 | 13.1 | 7.72, 22.1 | Q4 | 138/24024 | 15.4 | 8.23, 28.7 | |
| Model 2 2 | Q1 | 16/26795 | REF | Q1 | 11/29460 | REF | ||
| Q2 | 14/25852 | 0.86 | 0.42, 1.77 | Q2 | 18/28396 | 1.55 | 0.73, 3.29 | |
| Q3 | 42/23979 | 2.73 | 1.52, 4.90 | Q3 | 43/26823 | 3.58 | 1.82, 7.01 | |
| Q4 | 161/21291 | 12.1 | 7.11, 20.6 | Q4 | 138/24024 | 14.2 | 7.58, 26.8 | |
| Model 3 2 | Q1 | 16/26795 | REF | Q1 | 11/29460 | REF | ||
| Q2 | 14/2585 | 0.72 | 0.35, 1.49 | Q2 | 18/28396 | 1.34 | 0.63, 2.85 | |
| Q3 | 42/23981 | 1.55 | 0.85, 2.82 | Q3 | 43/26848 | 2.21 | 1.11, 4.38 | |
| Q4 | 161/21291 | 3.06 | 1.61, 5.83 | Q4 | 138/24024 | 4.08 | 1.98, 8.44 | |
| Model 4 2 | Q1 | 16/26795 | REF | Q1 | 11/29460 | REF | ||
| Q2 | 14/25852 | 0.73 | 0.36, 1.50 | Q2 | 18/28396 | 1.35 | 0.64, 2.87 | |
| Q3 | 42/23979 | 1.60 | 0.88, 2.93 | Q3 | 43/26823 | 2.27 | 1.14, 4.50 | |
| Q4 | 161/21291 | 3.043 | 1.59, 5.81 | Q4 | 138/24024 | 4.12 | 1.99, 8.53 | |
| Model 5 2 | Q1 | 4/7296 | REF | Q1 | 1/7897 | REF | ||
| Q2 | 2/7023 | 0.46 | 0.08, 2.51 | Q2 | 3/7815 | 2.61 | 0.27, 25.3 | |
| Q3 | 12/6197 | 2.22 | 0.67, 7.42 | Q3 | 12/6949 | 8.27 | 1.03, 66.1 | |
| Q4 | 40/5128 | 3.20 | 0.84, 12.2 | Q4 | 38/5807 | 12.5 | 1.44, 109 | |
Half of the eligible participants in Atherosclerosis Risk in Communities (ARIC) study at Visit 2 (1990-1992), randomly selected for the association analyses. Median follow-up time is 23 years.
Model 1: Score + demographic factors. Model 2: Model 1 + cancer risk factors (age, sex, race-study center, post-menopausal hormone use, alcohol drinking status, BMI categories, diabetes status, hypertension medication use, cholesterol status, eGFR). Model 3: Model 1 + self-reported smoking information (smoking status, packyears). Model 4: Model 2 + self-reported smoking information. Model 5: Model 4 + DNA methylation-predicted packyears (among 1438 participants with methylation data).
When adjusting for terms for former-quit ≤3 years and former-quit >3 years (both vs never smoker) instead of a term for former smoker: HR=3.17, 95% CI 1.64-6.11.
Table 5.
Hazard ratios (HR) and 95% confidence intervals (CI) of the association between smoking-related protein score quartiles and incidence of myocardial infarction (MI) and/or fatal coronary heart disease (CHD) and all-cause mortality, Set 21, ARIC, 1990-1992 to 2015.
| Incidence of MI/fatal CHD | All-cause Mortality | |||||||
|---|---|---|---|---|---|---|---|---|
| Quartile of smoking-related protein score |
Cases N/ Person-Years |
HR | 95% CI | Quartile of smoking-related protein score |
Cases N/ Person-Years |
HR | 95% CI | |
| Model 2 2 | Q1 | 136/31300 | REF | Q1 | 503/32781 | REF | ||
| Q2 | 195/29695 | 1.29 | 1.03, 1.61 | Q2 | 602/31412 | 1.17 | 1.04, 1.32 | |
| Q3 | 206/27693 | 1.25 | 1.00, 1.57 | Q3 | 695/29576 | 1.33 | 1.18, 1.50 | |
| Q4 | 246/23286 | 2.21 | 1.77, 2.77 | Q4 | 899/25724 | 2.65 | 2.36, 2.98 | |
| Model 4 2 | Q1 | 136/31300 | REF | Q1 | 503/32781 | REF | ||
| Q2 | 195/29695 | 1.28 | 1.02, 1.60 | Q2 | 602/31412 | 1.16 | 1.02, 1.30 | |
| Q3 | 206/27693 | 1.19 | 0.95, 1.50 | Q3 | 695/29576 | 1.24 | 1.10, 1.40 | |
| Q4 | 246/23286 | 1.74 | 1.30, 2.33 | Q4 | 899/25724 | 1.81 | 1.54, 2.12 | |
Half of the eligible participants in Atherosclerosis Risk in Communities (ARIC) study at Visit 2 (1990-1992), randomly selected for the association analyses. Median follow-up time is 23 years.
Model 2: Score + demographic factors + risk factors (age, sex, race-study center, post-menopausal hormone use, alcohol drinking status, BMI categories, diabetes status, hypertension medication use, cholesterol status, eGFR). Model 4: Model 2 + self-reported smoking information.
Given the residual association for the score after adjusting for self-reported smoking information, we estimated the score’s discriminative ability. The concordance statistic (C-statistic) ranged from 0.63-0.67 for smoking-related cancer and lung cancer incidence and mortality when only including cancer risk factors. After adding the score, the C-statistic increased to 0.70-0.82, which was comparable to when adding self-reported smoking information (0.70-0.84). When adding both the score and self-reported smoking information, either individually or as joint categories, the C-statistic had the same improvement (0.71-0.85; Table S5).
In the internal confirmation analysis recalculating the score using Visit 3 proteins (Table S6) and Visit 5 proteins (Table S7), we observed positive associations between the score and smoking-related cancer and lung cancer incidence and mortality, including after adjusting for self-reported smoking information, as for Visit 2.
Confirmation analysis
EPIC subcohort non-cases were younger (median 52 vs 57 years), had a higher percentage of women (60% vs 55%), and were from Europe versus US (Black and White) in ARIC. Prevalences of score quartiles in never and in former smokers were similar in EPIC and ARIC. In EPIC as in ARIC, the percentage of current smokers in the top score quartile was high, but lower in EPIC (Table S8). The HR of lung cancer comparing Q4 to Q1 was 9.47 (95% CI 6.82-13.15) before and 2.23 (95% CI 1.48-3.35) after adjusting for self-reported smoking information.
Discussion
In this prospective cohort study, we developed a score from plasma proteins related to detailed smoking history. This score was significantly, positively associated with incidence and mortality from smoking-related cancer, especially lung cancer, even after adjusting for self-reported smoking information and DNA methylation-predicted packyears. The score’s association was comparable by sex and race and remained among former and current smokers. We confirmed that the score is associated with smoking status and lung cancer risk, including after adjusting for self-reported smoking information, in an independent cohort. Thus, our hypothesis that the score provides relative risk information beyond self-reported smoking is supported. The score was positively associated with MI/fatal CHD and all-cause mortality in ARIC, confirming its broad applicability.
While prior publications have reported associations between measures of smoking history and individual plasma proteins measured by large-scale proteomics 6,7, none generated a score and investigated its association with smoking-related cancer. In our study, we identified proteins from ~5,000 that were individually associated with current smoking, packyears, and recent quitting. To reduce false discovery, we did not consider proteins that did not pass Bonferroni correction or were highly correlated. To ensure robustness and reproducibility of protein selection, the analysis was repeated multiple times and only proteins observed in high percentage of iterations were retained. Relaxing the threshold resulted in inclusion of more proteins, but the resulting scores yielded similar associations with incidence and mortality. While we did not weight each selected protein for the size of its association with smoking history in our algorithm, our simple score captured considerable relative risk information, including beyond self-reported smoking information and after adjusting for methylation-predicted packyears. In our internal confirmation, in which we recalculated the score using proteins measured 3 and 21 years later, we observed similar patterns of association.
We confirmed the score’s association with lung cancer incidence in EPIC. In both, the association was notably strong before adjusting for self-reported smoking information, but while attenuated, remained positive and statistically significant after adjustment. The HRs were moderately smaller in EPIC (HR: 9.47, 2.23 before and after smoking adjustment) than in ARIC (HR: 12.1, 3.04), which may be explained by a difference in packyears among current smokers (ARIC: 58% had ≥30 packyears, EPIC: median=21 packyears).
Our approach to select proteins for the score was agnostic. In reviewing them, some score proteins were previously reported for their associations or diagnostic value for lung or other smoking-associated cancers, such as MMP12 and WFDC2. Prior animal and human studies have reported the relationship between MMP12 and cigarette smoking.21,22 MMP1223,24 and WFDC224 were also identified in proteomic studies for associations with lung cancer using Olink.
When adding to the model containing cancer risk factors, we observed the same percentage increase in the C-statistic for the score and for self-reported smoking information, suggesting that they capture similar discriminative information for smoking-associated cancer and lung cancer. Thus, we expect that the score holds promise for use in the clinic when this information is often recalled with imprecision25 or possibly when the time during a healthcare visit is too short to solicit detailed smoking history. Inaccurate smoking history could result in misclassification of the screening eligibility. In the setting of risk-based cancer screening, use of blood biomarkers of exposure that are strongly associated with cancer risk might mitigate the impact of unreliable smoking history.26 If our score or further optimized version is further confirmed, trials will be needed to test whether the tool implemented in risk stratification is efficacious. With respect to risk stratification to guide cancer screening, studies would be needed to test whether combining the score with current screening guidelines better classifies high-risk individuals for lung and other smoking-associated cancers and improves screening performance. With respect to risk stratification to guide precision prevention, studies would be needed to test whether the score aids in targeting these strategies to the highest risk individuals to maximize risk mitigation. Studies will be needed to test whether the score helps to avoid targeting lower risk individuals to minimize unnecessary procedures and treatments. Additionally, research is needed to determine whether this score is complementary to proteomic-23 and methylation-based27 predictors of lung cancer.
This study has many strengths, including prospective design, inclusion of Black and White women and men, and high-quality assessment of incident cancer. Several aspects warrant discussion. While we cannot rule out chance, results were consistent in internal and external confirmation analyses. We did not attempt to optimize the score to improve its sensitivity and specificity for smoking history or cancer risk. Instead, we developed a simple score that would be easy to implement. The SomaScan® 5K panel may not include all relevant proteins. Work will be needed to assess the score’s within-person variation over time. Future investigations are needed to confirm the generalizability of the associations for the score in independent studies in diverse settings and using other proteomics platforms.
In conclusion, we developed a score derived from proteins related to detailed smoking history. This score was associated with incidence and mortality from smoking-related cancer and lung cancer and provided relative risk information beyond self-reported smoking and DNA methylation-predicted packyears. We confirmed the score’s association with lung cancer in an independent, multi-country cohort. The score was positively associated with MI/fatal CHD and all-cause mortality, confirming its broad applicability. Such a score may improve risk stratification for precision prevention and screening, especially in settings in which detailed smoking history cannot be obtained accurately or feasibly.
Supplementary Material
Acknowledgments
ARIC: The authors thank the staff and participants of the ARIC study for their important contributions. Cancer data were provided by the Maryland Cancer Registry, Center for Cancer Prevention and Control, Maryland Department of Health, with funding from the State of Maryland and the Maryland Cigarette Restitution Fund. The collection and availability of cancer registry data are also supported by the Cooperative Agreement NU58DP007114, funded by the Centers for Disease Control and Prevention. The content of this work is solely the responsibility of the authors and does not necessarily represent the official views of the Centers for Disease Control and Prevention, the National Institutes of Health and State of Maryland. A preliminary version of this work was previously presented at the American Association for Cancer Research Annual meeting in 2022.
EPIC: The authors thank all study participants for their participation and all interviewers who participated in the fieldwork studies in each EPIC center. The authors also thank Bertrand Hemon at International Agency for Research on Cancer (IARC) for his valuable work and technical support with the EPIC database. IARC disclaimer: Where authors are identified as personnel of the IARC/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the IARC/ World Health Organization.
The funders had no role in the design of the study; the collection, analysis, or interpretation of the data; or the writing of the manuscript and decision to submit it for publication.
Funding
ARIC: The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services, under contract nos. 75N92022D00001, 75N92022D00002, 75N92022D00003, 75N92022D00004, 75N92022D00005. Studies on cancer in ARIC are also supported by the National Cancer Institute (U01 CA164975). ARIC methylation data (HM450) on Black and White participants with exome chip PCs are also supported by 5RC2HL102419 and R01NS087541. SomaLogic Inc. conducted the SomaScan® assays in exchange for use of ARIC data. This work was supported in part by NIH/NHLBI grant R01 HL134320, and the Maryland Cigarette Restitution Fund at Johns Hopkins University.
EPIC: The coordination of EPIC-Europe is financially supported by International Agency for Research on Cancer (IARC) and also by the Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London which has additional infrastructure support provided by the NIHR Imperial Biomedical Research Centre (BRC). The national cohorts are supported by Associazione Italiana per la Ricerca sul Cancro-AIRC-Italy, Italian Ministry of Health, Italian Ministry of University and Research (MUR), Compagnia di San Paolo (Italy); Dutch Ministry of Public Health, Welfare and Sports (VWS), the Netherlands Organisation for Health Research and Development (ZonMW), World Cancer Research Fund (WCRF), (The Netherlands); Instituto de Salud Carlos III (ISCIII), Regional Governments of Andalucía, Asturias, Basque Country, Murcia and Navarra, and the Catalan Institute of Oncology - ICO (Spain); Cancer Research UK (C864/A14136 to EPIC-Norfolk; C8221/A29017 to EPIC-Oxford), Medical Research Council (MR/N003284/1, MC-UU_12015/1 and MC_UU_00006/1 to EPIC-Norfolk; MR/Y013662/1 to EPIC-Oxford) (United Kingdom). Previous support has come from “Europe against Cancer” Programme of the European Commission (DG SANCO). The generation of the proteomic data was partly funded by the Michael J Fox Foundation (#008994 to Christina M. Lill and Elio Riboli), the Cure Alzheimer’s Fund (to Christina M. Lill and Lars Bertram), the ‘CReATe- Clinical Research in ALS and Related Disorders for Therapeutic Development’ Consortium (to Christina M. Lill and Lars Bertram), with additional grant support from the Heisenberg program of the Deutsche Forschungsgemeinschaft (DFG; LI 2654/4-1 to Christina M. Lill). SomaScan® data were generated under Master Research Agreement, 14th December 2021, between Imperial College London and SomaLogic Inc. SomaLogic were not involved in analyzing or interpreting the data; or in writing or submitting the manuscript for publication.
Footnotes
Conflict of Interest Statement
The authors disclose no conflicts of interest.
Data Availability Statement
ARIC: ARIC data, including the data used in this analysis, can be accessed by the following policy: https://sites.cscc.unc.edu/aric/sites/default/files/public/listings/ARIC%20data%20sharing.pdf. ARIC data also are available via BioLINCC (controlled access database). Further information is available from the corresponding author upon request.
EPIC: For information on how to submit an application for gaining access to EPIC data and/or biospecimens, please follow the instructions at https://login.research4life.org/tacsgr0epic_iarc_fr/access/index.php.
References
- 1.United States Public Health Service. Smoking and Health: Report of the Advisory Committee to the Surgeon General of the Public Health Service. U.S. Department of Health, Education, and Welfare, Public Health Service; 1964. [Google Scholar]
- 2.Centers for Disease Control and Prevention (US), National Center for Chronic Disease Prevention and Health Promotion (US), Office on Smoking and Health (US). How Tobacco Smoke Causes Disease: The Biology and Behavioral Basis for Smoking-Attributable Disease: A Report of the Surgeon General. Centers for Disease Control and Prevention (US); 2010. Accessed January 30, 2024. http://www.ncbi.nlm.nih.gov/books/NBK53017/ [PubMed] [Google Scholar]
- 3.Patrick DL, Cheadle A, Thompson DC, Diehr P, Koepsell T, Kinne S. The validity of self-reported smoking: a review and meta-analysis. Am J Public Health. 1994;84(7):1086–1093. doi: 10.2105/AJPH.84.7.1086 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.DeCaprio AP. Biomarkers: Coming of Age for Environmental Health and Risk Assessment. Environ Sci Technol. 1997;31(7):1837–1848. doi: 10.1021/es960920a [DOI] [Google Scholar]
- 5.Institute of Medicine (US) Committee to Assess the Science Base for Tobacco Harm Reduction. Clearing the Smoke: Assessing the Science Base for Tobacco Harm Reduction. (Stratton K, Shetty P, Wallace R, Bondurant S, eds.). National Academies Press (US); 2001. Accessed December 13, 2021. http://www.ncbi.nlm.nih.gov/books/NBK222375/ [PubMed] [Google Scholar]
- 6.Williams SA, Kivimaki M, Langenberg C, et al. Plasma protein patterns as comprehensive indicators of health. Nat Med. 2019;25(12):1851–1857. doi: 10.1038/s41591-019-0665-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yuan S, Khodursky S, Geng J, et al. Circulating Protein Mediators Linking Genetically Predicted Smoking to Abdominal Aortic Aneurysm: A Genomic-Proteomic Analysis. Arteriosclerosis, Thrombosis, and Vascular Biology. 2025;45(9):1683–1692. doi: 10.1161/ATVBAHA.125.323057 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Landy R, Gomez I, Caverly TJ, et al. Methods for Using Race and Ethnicity in Prediction Models for Lung Cancer Screening Eligibility. JAMA Network Open. 2023;6(9):e2331155. doi: 10.1001/jamanetworkopen.2023.31155 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.The National Lung Screening Trial Research null. Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening. New England Journal of Medicine. 2011;365(5):395–409. doi: 10.1056/NEJMoa1102873 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Jonas DE, Reuland DS, Reddy SM, et al. Screening for Lung Cancer With Low-Dose Computed Tomography: Updated Evidence Report and Systematic Review for the US Preventive Services Task Force. JAMA. 2021;325(10):971–987. doi: 10.1001/jama.2021.0377 [DOI] [PubMed] [Google Scholar]
- 11.THE ARIC INVESTIGATORS. The Atherosclerosis Risk In Communities (ARIC) Study: Design and Objectives. American Journal of Epidemiology. 1989;129(4):687–702. doi: 10.1093/oxfordjournals.aje.a115184 [DOI] [PubMed] [Google Scholar]
- 12.Joshu CE, Barber JR, Coresh J, et al. Enhancing the Infrastructure of the Atherosclerosis Risk in Communities (ARIC) Study for Cancer Epidemiology Research: ARIC Cancer. Cancer Epidemiol Biomarkers Prev. 2018;27(3):295–305. doi: 10.1158/1055-9965.EPI-17-0696 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Riboli E, Kaaks R. The EPIC Project: rationale and study design. European Prospective Investigation into Cancer and Nutrition. Int J Epidemiol. 1997;26(suppl_1):S6. doi: 10.1093/ije/26.suppl_1.S6 [DOI] [PubMed] [Google Scholar]
- 14.Riboli E, Hunt KJ, Slimani N, et al. European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collection. Public Health Nutrition. 2002;5(6b):1113–1124. doi: 10.1079/PHN2002394 [DOI] [PubMed] [Google Scholar]
- 15.Gold L, Ayers D, Bertino J, et al. Aptamer-Based Multiplexed Proteomic Technology for Biomarker Discovery. PLOS ONE. 2010;5(12):e15004. doi: 10.1371/journal.pone.0015004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Walker KA, Chen J, Zhang J, et al. Large-scale plasma proteomic analysis identifies proteins and pathways associated with dementia risk. Nat Aging. 2021;1(5):473–489. doi: 10.1038/s43587-021-00064-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.White AD, Folsom AR, Chambless LE, et al. Community surveillance of coronary heart disease in the Atherosclerosis Risk in Communities (ARIC) Study: Methods and initial two years’ experience. Journal of Clinical Epidemiology. 1996;49(2):223–233. doi: 10.1016/0895-4356(95)00041-0 [DOI] [PubMed] [Google Scholar]
- 18.Sugden K, Hannon EJ, Arseneault L, et al. Establishing a generalized polyepigenetic biomarker for tobacco smoking. Transl Psychiatry. 2019;9(1):1–12. doi: 10.1038/s41398-019-0430-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhao N, Ruan M, Koestler DC, et al. Methylation-derived inflammatory measures and lung cancer risk and survival. Clin Epigenetics. 2021;13(1):222. doi: 10.1186/s13148-021-01214-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology. 1975;12(4):387–415. doi: 10.1016/0022-2496(75)90001-2 [DOI] [Google Scholar]
- 21.Churg A, Wang RD, Tai H, et al. Macrophage Metalloelastase Mediates Acute Cigarette Smoke–induced Inflammation via Tumor Necrosis Factor-α Release. Am J Respir Crit Care Med. 2003;167(8):1083–1089. doi: 10.1164/rccm.200212-1396OC [DOI] [PubMed] [Google Scholar]
- 22.Lavigne MC, Eppihimer MJ. Cigarette smoke condensate induces MMP-12 gene expression in airway-like epithelia. Biochemical and Biophysical Research Communications. 2005;330(1):194–203. doi: 10.1016/j.bbrc.2005.02.144 [DOI] [PubMed] [Google Scholar]
- 23.The Lung Cancer Cohort Consortium (LC3). The blood proteome of imminent lung cancer diagnosis. Nat Commun. 2023;14:3042. doi: 10.1038/s41467-023-37979-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Feng X, Wu WYY, Onwuka JU, et al. Lung cancer risk discrimination of prediagnostic proteomics measurements compared with existing prediction tools. J Natl Cancer Inst. 2023;115(9):1050–1059. doi: 10.1093/jnci/djad071 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kukhareva PV, Caverly TJ, Li H, et al. Inaccuracies in electronic health records smoking data and a potential approach to address resulting underestimation in determining lung cancer screening eligibility. Journal of the American Medical Informatics Association. 2022;29(5):779–788. doi: 10.1093/jamia/ocac020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Caverly TJ, Zhang X, Hayward RA, Zhu J, Waljee AK. Effects of Random Measurement Error on Lung Cancer Screening Decisions: A Retrospective Cohort-Based Microsimulation Study. Chest. 2021;159(2):853–861. doi: 10.1016/j.chest.2020.08.2112 [DOI] [PubMed] [Google Scholar]
- 27.Onwuka JU, Guida F, Langdon R, et al. Blood-based DNA methylation markers for lung cancer prediction. BMJ Oncol. 2024;3(1):e000334. doi: 10.1136/bmjonc-2024-000334 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Candia J, Daya GN, Tanaka T, Ferrucci L, Walker KA. Assessment of variability in the plasma 7k SomaScan proteomics assay. Sci Rep. 2022;12(1):17147. doi: 10.1038/s41598-022-22116-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Joshi A, Mayr M. In Aptamers They Trust: The Caveats of the SOMAscan Biomarker Discovery Platform from SomaLogic. Circulation. 2018;138(22):2482–2485. doi: 10.1161/CIRCULATIONAHA.118.036823 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
ARIC: ARIC data, including the data used in this analysis, can be accessed by the following policy: https://sites.cscc.unc.edu/aric/sites/default/files/public/listings/ARIC%20data%20sharing.pdf. ARIC data also are available via BioLINCC (controlled access database). Further information is available from the corresponding author upon request.
EPIC: For information on how to submit an application for gaining access to EPIC data and/or biospecimens, please follow the instructions at https://login.research4life.org/tacsgr0epic_iarc_fr/access/index.php.
