Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2013 Feb 9;20(4):652–658. doi: 10.1136/amiajnl-2012-001557

ICD-9 tobacco use codes are effective identifiers of smoking status

Laura K Wiley 1,2, Anushi Shah 2, Hua Xu 2, William S Bush 1,2
PMCID: PMC3721171  PMID: 23396545

Abstract

Objective

To evaluate the validity of, characterize the usage of, and propose potential research applications for International Classification of Diseases, Ninth Revision (ICD-9) tobacco codes in clinical populations.

Materials and methods

Using data on cancer cases and cancer-free controls from Vanderbilt's biorepository, BioVU, we evaluated the utility of ICD-9 tobacco use codes to identify ever-smokers in general and high smoking prevalence (lung cancer) clinic populations. We assessed potential biases in documentation, and performed temporal analysis relating transitions between smoking codes to smoking cessation attempts. We also examined the suitability of these codes for use in genetic association analyses.

Results

ICD-9 tobacco use codes can identify smokers in a general clinic population (specificity of 1, sensitivity of  0.32), and there is little evidence of documentation bias. Frequency of code transitions between ‘current’ and ‘former’ tobacco use was significantly correlated with initial success at smoking cessation (p<0.0001). Finally, code-based smoking status assignment is a comparable covariate to text-based smoking status for genetic association studies.

Discussion

Our results support the use of ICD-9 tobacco use codes for identifying smokers in a clinical population. Furthermore, with some limitations, these codes are suitable for adjustment of smoking status in genetic studies utilizing electronic health records.

Conclusions

Researchers should not be deterred by the unavailability of full-text records to determine smoking status if they have ICD-9 code histories.

Keywords: Electronic Health Records, Phenotype, International Classification of Diseases

Background and significance

Smoking is a well-recognized public health concern as it is a risk factor for multiple diseases such as cancer, asthma, chronic obstructive pulmonary disease, and heart disease.1 Because smoking alters the risk for disease, it is important to ascertain smoking status in epidemiological, clinical, and genetic studies.2 Traditional studies obtain this information through questionnaires or interviews, however many institutions are using de-identified versions of electronic medical records (EMRs) for clinical and genetic research.3–6 Smoking status is often stored in narrative text in EMRs, making identification more difficult.7 8 While documentation of smoking status as structured data is one of the objectives of ‘Meaningful Use’,9 it is important to be able to extract smoking status from records that may not be updated due to death or change of treatment facility.

Natural language processing (NLP) has shown promise in extracting smoking status from the text of medical records, and was the focus of the 2008 i2b2 Challenge.7 8 10 Research has demonstrated that NLP can determine specific smoking patterns (current, past, never, etc).11–13 Nevertheless, non-informaticians may be unable to use these algorithms due to technical limitations or privacy concerns since NLP requires medical record text.

Another potential method for identifying smoking status from EMRs is to use billing codes. The International Classification of Diseases, Ninth Revision (ICD-9) codes are frequently used in phenotype extraction algorithms.14 There are two primary ICD-9 tobacco use codes: 305.1—Current tobacco use and V15.82—History of tobacco use. Although some studies have reported limited sensitivity of these codes,15 no formal analysis of the validity of these codes has been performed.16

In this three-part study we evaluated the validity of, characterized the usage of, and propose potential research applications for ICD-9 tobacco use codes in clinical populations. To validate these codes we measured their accuracy in a general clinic population and a high smoking prevalence population (individuals with lung cancer). Characterization analyses evaluated documentation biases by providers/clinics and evaluated the consistency and appropriateness of assignment based on the temporal aspects of the codes. Finally, we explored the potential utility of these codes for genetic analyses by evaluating their suitability as controls for smoking status in a genetic association study of lung cancer.

Materials and methods

Study cohort

Given that smoking increases the risk of cancer, we analyzed a cohort originally ascertained to examine genetic associations with different forms of the following cancers: breast, colorectal, endometrial, lung, melanoma, non-Hodgkin's lymphoma, ovarian, and prostate. This cohort of 17 514 individuals consists of 7050 cancer cases and 10 464 cancer-free controls collected from BioVU, the Vanderbilt DNA Biobank.4 To be included as a cancer case, a valid tumor registry entry (the Tennessee Cancer Registry) was required. For breast, prostate, and melanoma cancer, individuals were also considered cases if there were three instances of an appropriate ICD-9 code (174.*, 185.*, and 172.*, respectively). Age and sex matched controls were required to have two or more clinical narratives (clinic note, discharge summary, etc) in their electronic record. Colorectal cancer controls with at least one normal colonoscopy were preferentially selected. Control individuals were excluded if they had any ICD-9 codes between 140.* and 239.* (the general range for neoplasms) or an entry in the tumor registry. Additionally, individuals were excluded if their problem list contained any words that could indicate any form of cancer.

Smoking status definitions

This study compares manually reviewed smoking status to two automated definitions of ever/never-smokers: ICD and NLP. ‘ICD ever-smokers’ had one or more ICD-9 tobacco use disorder (305.1) or history of tobacco use (V15.82) codes. ‘ICD never-smokers’ had no record of either of these codes. NLP smoking status was based on an implementation of the cTAKES smoking algorithm17 at Vanderbilt.18 To create the gold standard, a single reviewer read the full text of selected medical records to identify smoking status. Individuals were classified as ever-smokers if there was any mention of smoking in the record, regardless of duration, quantity, or time since cessation. Never-smokers were required to have explicit documentation of smoking status as none, never, or life-time non/never-smoker in their record.

Assess validity of codes within clinical populations

To determine the validity of ICD-9 tobacco use codes within a general clinic population, we examined 100 gold standard ever-smokers and 100 gold standard never-smokers from our control group of individuals without cancer. This sample size provided 85% power assuming an α of 0.05. We used this population because it is representative of the general clinic population at Vanderbilt. We categorized each individual as an ‘ICD ever/never-smoker’ and an ‘NLP ever/never-smoker’ and calculated the sensitivity, specificity, and accuracy of each definition. We also combined the two definitions so that ‘ever-smokers’ consisted of those labeled as smokers by either the ICD or NLP method (or both).

Next we investigated whether there was sufficient power to detect an association between smoking and lung cancer risk given the previously reported low sensitivity of the codes.15 We performed logistic regression on ICD ever/never-smokers for 731 lung cancer cases and 9003 controls adjusting for age and gender. Additionally, we expected that smoking status would be documented more frequently in lung cancer cases as smoking is a significant risk factor.1 As a result, we hypothesized that a lack of smoking codes for individuals with lung cancer could indicate never-smoking status. To evaluate this hypothesis, a single reviewer examined the records of all 288 ‘ICD never-smokers’ in our lung cancer population and classified individuals with the gold standard definition. Eighty of these records were reviewed a second time to confirm the accuracy of the classification. We then calculated the false negative rate (FNR) of the ICD definition (ie, ratio of gold standard never-smokers to ‘ICD never-smokers’). Individuals were excluded from this analysis if smoking status could not be identified by manual review. Given that the risk for lung cancer diminishes as a function of time since smoking cessation,19 we also hypothesized that healthcare providers may limit documentation of remote smoking. Thus we also considered a ‘relative risk’ definition where individuals who quit smoking prior to 1990 were considered non-smokers. This ‘cut point’ corresponds to at least 15 years since cessation since all individuals had one or more visits during or after 2007. We then compared the ICD smoking definition to the ‘relative risk’ gold standard with a FNR calculation. Former smokers whose quit date could not be determined from the record were excluded from the second analysis only.

Characterize usage of codes

To understand potential documentation biases, we examined which types of providers/clinics tend to document smoking status with ICD-9 codes. Unfortunately, in our research version of the EMR, it is not possible to identify providers or clinics that assign particular ICD-9 codes. However, we used contemporaneously assigned ICD-9 and Current Procedural Terminology (CPT) codes as surrogate variables for the type of provider. Thus, we extracted all ICD-9 and CPT codes for the 3752 ‘ICD ever-smokers’ in the entire study population. The codes were stratified based on co-documentation with tobacco use codes (ie, whether the code was assigned on the same date as the tobacco use code). We used the frequency of individuals with and without co-documentation to measure the difference in reporting of tobacco use.

Interestingly, the two ICD-9 tobacco use codes are divided into current and past use. We wanted to identify whether providers used the two codes interchangeably and whether temporal arrangements of these codes modeled actual behavior patterns (as recorded in the medical record). In order to capture individuals who had multiple smoking cessation attempts, we required individuals to have at least two current tobacco use ICD-9 codes (305.1) and at least two ‘history of tobacco use’ ICD-9 codes (V15.82) for this analysis. We ordered each individual's smoking codes by date and then collapsed identical codes to create a summary of chronological code transitions (ie, switching from 305.1 to V15.82 and visa versa over time). The number of transitions for each subject was tallied and individuals were grouped based on whether they had one transition or more than one transition. Following manual review of the medical records, individuals were grouped into four categories: always smokers (ie, no notation of trying to quit), successful quitters (ie, smoked then quit and never resumed), unsuccessful quitters (eg, smoked, quit, and resumed in any pattern and frequency), and those who we were unable to classify. Individuals in the last category were removed from further analysis. We then performed a 2×3 χ2 analysis comparing the two types of transitions (single vs multiple) and actual smoking status (always, single successful quit attempt, unsuccessful quit attempt/s).

Investigate research applications of codes

Finally, given the increased interest in using EMR-linked biobanks for genetic research,5 6 we investigated the suitability of ICD identified smoking status as a covariate in genetic analysis. We anticipated that the low ICD-9 documentation of smoking status would reduce our power to detect the effects of single nucleotide polymorphisms (SNPs). This reduction in power would likely reduce the accuracy of effect estimates for SNP analysis. To test these hypotheses, we performed logistic regression on 12 SNPs for 731 lung cancer cases and 9003 controls. There are well established relationships between SNPs that alter smoking behavior and lung cancer,20 so controlling for smoking status is especially important for these analyses. The SNPs examined were previously reported in genome-wide association studies as associated with lung cancer. Our regression model controlled for gender, age, the top three principle components (reflecting genetic ancestry), and smoking status. We evaluated three models for each SNP, one using the NLP-defined smoking status, one using the ICD-defined smoking status (described earlier), and one unadjusted for smoking status. For comparisons, regression models using the NLP-defined smoking status are considered our ideal model. For the results of the regression models we used a significance threshold of p<0.004 (the Bonferroni correction for 12 independent tests). Given that many of the associations may be robust against the effect of smoking, we examined all SNPs where there was non-concordance between the three models at a nominal significance level (p<0.05). As before, we used the NLP-adjusted model as the ideal and determined how the ICD-adjusted model compared. To assess how smoking definition alters statistical power to detect the effects of lung cancer SNPs, we performed post hoc power calculation using CaTS V.0.0.2.21 We assumed a lung cancer prevalence of 10% based on available SEER data from 2009,22 and ORs from each regression model were used as point estimates of the genotype effect size. Across all SNPs, we calculated the average change in statistical power between the models adjusted using the ICD or NLP methods. The end result of this power calculation is the expected change in power to detect lung cancer-associated SNPs when using ICD-based smoking status versus NLP-based smoking status.

Results

A breakdown of the cohort by cancer and control group type as well as ICD smoking status is presented in table 1. Additionally table 1 gives counts of individuals with multiple code transitions (two or more 305.1 codes and two or more V15.82 codes) and the average number of ICD-9 tobacco use codes for these individuals. The maximum number of codes any single individual in the transition group had was 43 codes (from the breast cancer control cohort).

Table 1.

Cohort summary (n=17 514)

ICD ever-smokers,* n (%) Individuals with multiple codes,† n (average number of codes)
Cases
 Breast cancer (n=1377) 217 (15.8%) 3 (7)
 Colorectal cancer (n=912) 270 (29.6%) 4 (7)
 Endometrial cancer (n=238) 33 (13.9%) 1 (15)
 Lung cancer (n=924) 588 (63.6%) 50 (11)
 Melanoma (n=1348) 275 (20.4%) 4 (5)
 Non-Hodgkin's lymphoma (n=348) 68 (19.5%) 5 (7)
 Ovarian cancer (n=186) 46 (24.7%) 2 (4)
 Prostate cancer (n=2145) 569 (26.5%) 10 (10)
Controls
 Breast cancer (n=2753) 358 (13.0%) 9 (13)
 Colorectal cancer (n=1822) 333 (18.3%) 7 (12)
 Melanoma (n=2630) 472 (17.9%) 2 (6)
 Ovarian cancer (n=342) 51 (14.9%) 2 (5)
 Prostate cancer (n=4288) 931 (21.7%) 17 (11)

*One or more 305.1 or V15.82 ICD-9 codes.

†Two or more 305.1 ICD-9 codes and two or more V15.82 ICD-9 codes.

ICD-9, International Classification of Diseases, Ninth Revision.

Assess validity of codes within clinical populations

Performance metrics of the ICD and NLP definitions of ever-smokers on the sampled individuals are provided in table 2.

Table 2.

Performance of ICD, NLP, and combined definitions of ever-smokers from group of ever-smokers (n=100) and never-smokers (n=100)

Sensitivity (95% CI) Specificity (95% CI) Accuracy (95% CI)
ICD only 0.32 (0.23 to 0.41) 1 0.66 (0.59 to 0.73)
NLP only 0.78 (0.70 to 0.86) 0.88 (0.82 to 0.94) 0.83 (0.78 to 0.88)
ICD+NLP* 0.82 (0.75 to 0.90) 1 0.91 (0.87 to 0.95)

*Ever-smokers if either ICD or NLP (or both) classify as ever-smoker.

ICD, International Classification of Diseases; NLP, natural language processing.

The logistic regression of ICD-defined smokers on lung cancer incidence showed a significant smoking effect with an OR (95% CI) of 9.69 (8.21 to 11.43). Results of the FNR analysis in individuals with lung cancer are presented in table 3. Of the 288 lung cancer cases without an ICD-9 code for tobacco use, we were unable to identify the smoking status of five individuals, who were excluded from both analyses. We calculated the FNR using 86 never-smokers and 197 ever-smokers. In the second analysis, eight individuals had quit smoking, but a quit date could not be determined, so they were excluded from the second calculation only. The low risk category consisted of 149 individuals (86 never-smokers and 63 ever-smokers with >15 years since smoking cessation). The high risk category consisted of 126 individuals (108 who quit after 1990 and 18 whose records indicate continuous smoking).

Table 3.

False negative rate (FNR) of International Classification of Diseases definition of never-smokers in a lung cancer case cohort (n=288)

Definition Count (n) FNR (95% CI)
Gold standard 0.70 (0.64 to 0.75)
 Ever-smokers 197
 Never-smokers 86
 Unable to classify 5
Relative risk 0.46 (0.40 to 0.52)
 Ongoing smoking 18
 Quit smoking during/after 1990 108
 Quit smoking prior to 1990 63
 Unable to assign quit date 8

Characterize usage of codes

Figures 1 and 2 describe the relative frequency of ICD-9 and CPT codes documented concurrently (ie, on the same date) with the smoking ICD-9 codes. ICD-9 codes are broken down by general code type in figure 1. The ICD-9 codes most frequently co-documented with tobacco use ICD-9 codes are: 17.42—laparoscopic robotic assisted procedure, 40.3—regional lymph node excision, 40.53—radical excision of iliac lymph nodes, 60.5—radical prostatectomy, and V58.66—long-term (current) use of aspirin. The three codes least frequently documented with smoking are: 780.79—other malaise and fatigue, V67.09—following other surgery, and V72.83—other specified preoperative examination. CPT codes are plotted in figure 2. The most frequently co-documented codes are: 38571—laparoscopic surgery with bilateral total pelvic lymphadectomy, 55866—laparoscopic prostatectomy, and 88309—level VI surgical pathology gross and microscopic examination. The least frequently co-documented codes are: 36415—collection of venous blood by venipuncture, 99024—postoperative follow-up visit, and 99213—office outpatient visit.

Figure 1.

Figure 1

Frequency of International Classification of Diseases, Ninth Revision (ICD-9) co-documentation with tobacco use codes. Each point represents a single ICD-9 code. Points to the right of the equal documentation line (indicated in white) have an increased frequency of co-documentation with an ICD-9 tobacco use code (eg, 300 more individuals had their code documented on the same day than those whose code was documented on different days). Points to the left of the center line have a reduced frequency of co-documentation with tobacco use codes.

Figure 2.

Figure 2

Frequency of Current Procedural Terminology (CPT) co-documentation with tobacco use codes. Each point represents a single CPT code. Points to the right of the equal documentation line (indicated in white) have an increased frequency of co-documentation with an International Classification of Diseases, Ninth Revision tobacco use code (eg, 300 more individuals had their code documented on the same day than those whose code was documented on different days). Points to the left of the center line have a reduced frequency of co-documentation with tobacco use codes.

Of the 116 eligible individuals, six were excluded from the analysis because their smoking pattern could not be determined after manual review of their medical records. A contingency table comparing the number of code transitions (single or multiple) to behavior patterns (always smoker, single successful quit attempt, and one or more unsuccessful quit attempts) for the remaining 110 individuals is presented in table 4. The mean numbers of smoking codes for single and multiple code transitions are 11 and 10 codes per individual, respectively.

Table 4.

Contingency table comparing longitudinal smoking pattern and relative number of ICD-9 code transitions (n=110)

Continuous smoking Single successful quit attempt >1 Unsuccessful quit attempts
Single code transition 2 25 21
Multiple code transitions 8 9 45

χ2 = 18.3725; df=2; p<0.0001.

ICD-9, International Classification of Diseases, Ninth Revision.

Investigate research applications of codes

Results from the logistic regression of 12 SNPs on lung cancer are presented in figure 3. Each SNP has three regression results: one unadjusted for smoking status, one adjusted for NLP-derived smoking status, and the other adjusted using ICD-defined smokers. All SNP models—even those that are not significant—have the same direction of effect and overlapping 95% CIs. Two SNP associations had non-concordant significance levels between the three different models (at a nominal significance cut-off of p<0.05). rs402710 was listed as nominally significant in both the unadjusted and ICD-adjusted models (p=0.027 and p=0.044, respectively). The NLP-adjusted model for this SNP was not significant (p=0.065). rs7626795 was nominally significant in the unadjusted model (p=0.047) and not significant in either the NLP or ICD-adjusted models (p=0.15 and p=0.17, respectively). Post hoc power calculations using these ORs averaged a 0.8% decrease in power (SD 4.6%) for models using ICD-defined smoking status.

Figure 3.

Figure 3

Effects of ICD-based tobacco use covariates on genetic associations. Single nucleotide polymorphism (SNP) ORs estimated by logistic regression for unadjusted (black) and adjusted for natural language processing (blue) or ICD (green) defined smoking status are plotted with their 95% CIs. Significant SNP models (Bonferroni corrected, p<0.004) are indicated in bold. Published ORs for the significant SNPs are represented by the red points. ICD, International Classification of Diseases; NLP, natural language processing.

Discussion

Previous studies have suggested that tobacco use ICD-9 codes are an unreliable method of determining smoking status. In this study we sought to validate the accuracy of these codes and investigate potential uses of these data for future research studies. Our study supports the notion that tobacco use codes are not particularly sensitive within a general (without cancer) clinic population, although our study showed a higher sensitivity than previously reported.15 This could be due to sampling differences in clinic populations as Vanderbilt Medical Center is a tertiary care facility treating patients from mid Tennessee and Kentucky—areas with a high prevalence of smoking. An important feature of detection methods is preventing false positives. The specificity of the tobacco use ICD-9 codes was perfect, indicating the exceptional utility of these codes for identifying true smokers. Indeed, none of the ∼150 records with a tobacco use ICD-9 code examined was an identifiable false positive (five records had no explicit documentation of tobacco use). This high specificity supports the use of these codes for the identification of smokers for clinical studies, making participant selection easier and increasing patient privacy as this method does not require the full text of the medical record. While the significance and interpretation of the results of the NLP methods alone are discussed elsewhere,18 our analysis demonstrates that combining ICD-9 codes with the NLP method increased the sensitivity over the NLP method alone.

It is well known that smoking is a significant risk factor for lung cancer. As a measure of the efficacy of tobacco codes to identify smokers, we replicated the association of smoking with lung cancer using ICD-identified smokers. Published ORs for the effect of smoking on lung cancer vary widely based on gender, age, duration, and amount of smoking as well as lung cancer histologic type,23 with values ranging from 10 to 30-fold increased risk.1 The OR from our logistic regression model was on the low end of these estimates, an expected result given the granularity of our tobacco exposure measure as ever/never-smokers rather than the more specific pack-years measure. Additionally, we found that women had a slightly higher OR than men when we stratified our analysis by gender (data not shown). One potential confounding element in this analysis is the increased documentation rates of smokers with lung cancer as compared to populations without cancer. We did not specifically measure the sensitivity of ICD tobacco use codes in our lung cancer population, and we made the assumption that codes would continue to have perfect specificity as in populations without cancer and only examined individuals without tobacco use ICD-9 codes. Our hypothesis was that if documentation increased in this population, it was possible that a lack of documentation was informative of never-smoking status. Unfortunately, although tobacco use has a great effect on recurrence risk and treatment protocols for lung cancer, our data do not support that the lack of a tobacco use ICD-9 code is reflective of actual never-smoking status. Interestingly, the rate of false negatives declined when we considered relative risk from tobacco use. It is well known that the risk of cancer decreases with increased time since cessation of smoking. It is possible that as the risk decreased so did the impetus to document smoking status with ICD-9 codes. To support this hypothesis, the time since cessation for former smokers who did receive an ICD-9 code would need to be examined. Still, our data do not support using the lack of a smoking code as an indicator of never-smoking or remote smoking history.

A potential concern in using these codes for study ascertainment is the induction of an unknown bias towards certain clinical populations. For example, if tobacco use ICD-9 codes were preferentially documented in a cancer clinic, studies ascertaining smokers using the codes would be enriched for cancer patients, potentially biasing study results. Our study is limited because we cannot link the ICD-9 codes to any particular clinic or physician. This effect may be compounded by billing practices at Vanderbilt. Typically in outpatient clinics the physician assigns ICD-9 codes for visits, whereas patients admitted to the hospital are assigned codes by a team of professional billers in consultation with the physician and the medical record. With this caveat, our results demonstrated a slight increase in co-documentation of tobacco codes with surgical procedures. This could reflect the increased risk of surgery in smoking populations or justification for billing for additional services. Performing this analysis across the clinical EMR (where clinic and provider level data are available) may provide better resolution of potential biases. Additionally, future analyses should be performed at a different medical center to assess the generalizability of these trends.

The two tobacco ICD-9 codes have a temporal element reflecting current use (305.1) and past use (V15.82). However, it was unknown whether these codes are used interchangeably. Transitions between these codes showed remarkable similarity to behavioral changes recorded in the EMR, indicating that the temporal component of the code appears to be used appropriately. Discrepancies between assigned ICD-9 codes and actual smoking behavior were often due to either persistent duplication of a single smoking misclassification, or a failure to document updated smoking status in patient problem lists. This error may be inflated by in-patient professional medical coders using the problem list for billing rather than physician notes. Interestingly, the history of the tobacco use code was often documented when heavy smokers reduced cigarette consumption, perhaps reflecting the physician anticipating complete cessation. Additionally, given the significant interest in identifying factors affecting smoking cessation, we evaluated the potential of code transitions to model smoking behavior as single successful quit attempts or one or more failed quit attempts. Extracting even this granular classification of behavior from a clinical record would be difficult given the complexity of identifying sequences of events with NLP. Our results support the hypothesis that multiple code transitions are associated with one or more failed quit attempts. Thus researchers could use the frequency of transitions between ICD-9 tobacco use codes to identify individuals who have had difficulty with smoking cessation. An important caveat of this analysis is that we are matching to smoking status as recorded in the medical record. Given that the accuracy and completeness of smoking status in medical records is variable, it is unclear how code transitions relate to actual behavioral patterns. Excitingly, however, this work serves as a proof of concept that temporal transitions in codes can be used to accurately model transitions in the medical record. Further research is needed to determine how this temporal modeling works on other types of ICD-9 codes, but our data suggest that using transitions among ICD-9 codes may be useful in phenotyping algorithms.

Finally, as more genetic studies use phenotypes derived from EMRs, we investigated the suitability of these codes for use as a covariate in genetic models. We performed association analyses with lung cancer, as controlling for smoking status is crucial in these models. In data to be presented elsewhere, we have shown that the regression models using NLP-derived smoking status as a covariate provide similar results to traditionally collected genetic epidemiology studies. When this ideal model is compared to models using the ICD-defined smokers for adjusted and unadjusted models, many of the SNP associations appear robust to the effect of smoking (as evidenced by no difference between unadjusted and adjusted models). While the similarity of ORs and significance levels between NLP and ICD-adjusted models is encouraging, our data are insufficient to definitively declare the equivalence of these methods. To better understand some of the potential biases of the two correction models, we reduced the significance threshold to nominal significance (p<0.05) and examined associations where the models were not uniformly concordant. One association (rs402710) was identified as not significant in the NLP-adjusted model and significant in the unadjusted and ICD-adjusted models. It is unclear whether this result is an effect of reduced power to adjust smoking by the ICD-based method or a feature of the underlying SNP association. While the original report of this associate showed only slight attenuation of the signal based on tobacco use status,24 another SNP in moderate linkage disequilibrium (rs31489, r2 = 0.687) was highly affected by smoking status as the association was only significant in current smokers.25 The second non-concordant association was for rs7626795, where both smoking adjusted models eliminated the nominal significance of the association from the unadjusted model. Interestingly, this association failed to replicate in the second stage of the reporting study, supporting our lack of replication under both adjusted models.26 Unfortunately, these data are inconclusive for evaluation of correction of SNP associations for smoking status. Ideally we would test these adjustment methods on SNPs that are highly sensitive to tobacco use status. Nevertheless, the totality of evidence presented in this paper leads us to cautiously support the usage of ICD-9-defined smoking status for use in genetic association studies. Given the decreased sensitivity of this method compared to NLP, we expected there would be a decrease in power to detect SNP effects. However, we observed a less than 1% average reduction in power using the ICD-9 classification method. While NLP-based methods or manual retrieval of smoking status will provide the most accurate results, unavailability of free-text records should not deter investigators from using structured data to determine smoking status.

Conclusion

This study demonstrates that ICD-9 tobacco use codes are very specific, if not particularly sensitive, for identifying smoking status in a general clinic population. Importantly, absence of these codes does not indicate a lack of smoking, even in a high smoking prevalence population (eg, lung cancer). Documentation procedures among different types of physicians and clinics appear to be similar. Temporal analysis reveals that transitions between current tobacco use and history of tobacco use codes correlate with behavior as recorded in patients’ medical records. Finally, use of these methods to determine smoking status for genetic studies does not appear to significantly alter regression outcomes or reduce power.

Acknowledgments

We would like to thank Robert Goodloe for scripting support for the association analyses and Drs Joshua Denny and Trent Rosenbloom for their insightful feedback.

Footnotes

Funding: This work was supported in part by the National Institute of General Medical Sciences Training grant T32GM080178, the National Cancer Institute grant R01CA141307, and by the National Human Genome Research Institute grant U01HG004798 for the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study.

Competing interests: None.

Provenance and peer review: Commissioned; externally peer reviewed.

References

  • 1.The health consequences of smoking: a report of the surgeon general. Atlanta, GA: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health, 2004 [Google Scholar]
  • 2.McGinnis KA, Brandt CA, Skanderson M, et al. Validating smoking data from the veteran's affairs health factors dataset, an electronic data source. Nicotine Tob Res 2011;13:1233–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Murphy EC, Ferris FL, O'Donnell WR. An electronic medical records system for clinical research and the EMR EDC interface. Invest Ophthalmol Vis Sci 2007;48:4383–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Roden DM, Pulley JM, Basford MA, et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 2008;84:362–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ritchie MD, Denny JC, Crawford DC, et al. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet 2010;86:560–72 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.McCarty CA, Chisholm RL, Chute CG, et al. The emerge network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics 2011;4:13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Clark C, Good K, Jezierny L, et al. Identifying smokers with a medical extraction system. J Am Med Inform Assoc 2008;15:36–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hazlehurst B, Sittig DF, Stevens VJ, et al. Natural language processing in the electronic medical record: Assessing clinician adherence to tobacco treatment guidelines. Am J Prev Med 2005;29:434–9 [DOI] [PubMed] [Google Scholar]
  • 9.Blumenthal D, Tavenner M. The "meaningful use" regulation for electronic health records. N Engl J Med 2010;363:501–4 [DOI] [PubMed] [Google Scholar]
  • 10.Uzuner O, Goldstein I, Luo Y, et al. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc 2008;15:14–24 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Savova GK, Ogren PV, Duffy PH, et al. Mayo clinic NLP system for patient smoking status identification. J Am Med Inform Assoc 2008;15:25–8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wicentowski R, Sydes MR. Using implicit information to identify smoking status in smoke-blind medical discharge summaries. J Am Med Inform Assoc 2008;15:29–31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zeng QT, Goryachev S, Weiss S, et al. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: Evaluation of a natural language processing system. BMC Med Inform Decis Mak 2006;6:30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Carroll RJ, Eyler AE, Denny JC. Naïve electronic health record phenotype identification for rheumatoid arthritis. AMIA Annu Symp Proc 2011; 2011:189–96 [PMC free article] [PubMed] [Google Scholar]
  • 15.Thompson WH, St-Hilaire S. Prevalence of chronic obstructive pulmonary disease and tobacco use in veterans at boise veterans affairs medical center. Respir Care 2010;55:555–60 [PubMed] [Google Scholar]
  • 16.DiFranza J, Ursprung WW. A systematic review of the International Classification of Diseases criteria for the diagnosis of tobacco dependence. Addict Behav 2010;35:805–10 [DOI] [PubMed] [Google Scholar]
  • 17.Sohn S, Savova GK. Mayo clinic smoking status classification system: extensions and improvements. AMIA Annu Symp Proc 2009;2009:619–23 [PMC free article] [PubMed] [Google Scholar]
  • 18.Liu M, Shah A, Peterson NB, et al. A study of transportability of an existing smoking status detection module across institutions. AMIA Annu Symp Proc 2012;2012:577–86 [PMC free article] [PubMed] [Google Scholar]
  • 19.Khuder SA, Mutgi AB. Effect of smoking cessation on major histologic types of lung cancer*. Chest 2001;120:1577–83 [DOI] [PubMed] [Google Scholar]
  • 20.Thorgeirsson TE, Gudbjartsson DF, Surakka I, et al. Sequence variants at CHRNB3-CHRNA6 and CYP2A6 affect smoking behavior. Nat Genet 2010;42:448–53 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Skol AD, Scott LJ, Abecasis GR, et al. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 2006;38:209–13 [DOI] [PubMed] [Google Scholar]
  • 22.Howlader N, Noone AM, Krapcho M, et al. SEER cancer statistics review, 1975–-2009 (vintage 2009 populations). Bethesda, MD: National Cancer Institute; Http://seer.cancer. Gov/csr/1975_2009_pops09/, Based on November 2011 SEER Data Submission, Posted to the SEER Website, April 2012. (accessed 10 July 2012). [Google Scholar]
  • 23.Lubin JH, Caporaso NE. Cigarette smoking and lung cancer: modeling total exposure and intensity. Cancer Epidemiol Biomarkers Prev 2006;15:517–23 [DOI] [PubMed] [Google Scholar]
  • 24.McKay JD, Hung RJ, Gaborieau V, et al. Lung cancer susceptibility locus at 5p15.33. Nat Genet 2008;40:1404–6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Landi MT, Chatterjee N, Yu K, et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet 2009;85:679–91 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Amos CI, Wu X, Broderick P, et al. Genome-wide association scan of tag snps identifies a susceptibility locus for lung cancer at 15q25.1. Nat Genet 2008;40:616–22 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES