TO THE EDITOR:
In the companion to this article, Karimi et al1 described the use of natural language processing (NLP) to identify distant cancer recurrence, along with the timing and the specific site of distant recurrence, for patients with breast and hepatocellular carcinomas. We commend the authors for developing techniques that identify the specific site of recurrence using unstructured data derived from the electronic medical record at one health care system. However, we believe that the manuscript does not address important methodologic limitations and does not provide sufficient details regarding key results. Moreover, to fully grasp the potential impact of this study, it is important to place its findings in a broader context that considers previously developed recurrence detection methods and the practical constraints of electronic health record (EHR) data.
First, the authors describe their technique as being NLP-specific, contrasting their approach to algorithms that rely on structured standardized codes (eg, International Classification of Diseases [ICD], pharmacy, etc). However, the cohorts used in this study were subsets of patients with cancer specifically selected on the basis of their high likelihood of having recurrence. Moreover, a priori identification of these high-risk patients relied on the use of structured data codes (eg, inclusion into the breast cancer cohorts was dependent on specific procedure and pharmacy codes, such as having either undergone a prior surveillance mammogram followed by two computed tomography or magnetic resonance examinations or having received two systemic therapies). Thus, the approach is not NLP-only, but rather a hybrid technique that combines NLP with claims- or EHR-based codes. The metrics reported in the manuscript only assess the performance of the NLP component within the enriched high-risk cohort; the results may be biased or not be representative of the algorithm's performance in a larger or unselected cancer cohort. We suggest that the correct denominator for assessing algorithm performance should be the full sample of 7,116 patients, and that the algorithm should include both the codes used to select high-risk patients and the NLP model to identify recurrence.
Second, the investigators included stage IV breast cancer cases in their analysis. These patients have de novo distant metastatic disease and, in nearly all situations, will never become disease-free. So, they cannot be eligible to recur. If the algorithm were to identify events in these patients, these events would be better classified as disease progression rather than cancer recurrence.
Third, the investigators compared performance of their hybrid model with that of a claims-based approach that relied only on a small group of ICD codes. Previous studies have demonstrated that this is an inferior approach that does not reflect current practice.2 Published studies have also described methods that use standardized codes to detect cancer recurrence and the timing of recurrence across multiple cancer sites, including breast, colorectal, and lung cancer.3-5 Validation of these recurrence detection or timing algorithms, which have used multiple data sets and large patient populations, has demonstrated that these algorithms offer a recurrence detection sensitivity of ≥70% and an area under the receiver operating characteristic curve of ≥ 0.900. Moreover, these published algorithms have proven to be valuable for addressing a variety of novel and timely cancer outcome research questions using public and open data sources.6-10
Fourth, this study found that 2.31% of recurrent patients had target ICD codes. This value is extremely low and seems to be inconsistent with previous research studies, suggesting either a major omission of key codes or that the data set used for this analysis is not representative of data sets derived from other health care system EHRs or Medicare claims.8
Fifth, the authors underscore the superiority of their findings regarding estimating recurrence timing, yet the manuscript provides little description of the recurrence timing model methods or results. If the authors are referring to specific results from their previous manuscript, additional clarification would be illustrative.
Finally, it seems likely that NLP techniques using unstructured text data will be superior to recurrence detection or timing algorithms that rely on structured EHR or claims data. However, NLP-based techniques have important shortcomings when used in the real-word setting: (1) access to unstructured text data and computing resources may be limited and (2) NLP-based solutions may be less portable between data sources because of variability in the way that clinicians document unstructured text. For all these reasons, we believe that EHR- or claims-based recurrence detection algorithms will continue to be useful for researchers for the foreseeable future.
Michael J. Hassett
Research Funding: IBM
Hajime Uno
Consulting or Advisory Role: Roche
No other potential conflicts of interest were reported.
SUPPORT
Supported by the Division of Cancer Epidemiology and Genetics, National Cancer Institute (R01 CA172143), M.J.H.
AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST
The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.
Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).
Michael J. Hassett
Research Funding: IBM
Hajime Uno
Consulting or Advisory Role: Roche
No other potential conflicts of interest were reported.
REFERENCES
- 1.Karimi YH, Blayney DW, Kurian AW, et al. : Development and use of natural language processing for identification of distant cancer recurrence and sites of distant recurrence using unstructured electronic health record data. JCO Clin Cancer Inform 5:469-478, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hassett MJ, Ritzwoller DP, Taback N, et al. : Validating billing/encounter codes as indicators of lung, colorectal, breast, and prostate cancer recurrence using 2 large contemporary cohorts. Med Care 52:e65-e73, 2014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ritzwoller DP, Hassett MJ, Uno H, et al. : Development, validation, and dissemination of a breast cancer recurrence detection and timing informatics algorithm. J Natl Cancer Inst 110:273-281, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hassett MJ, Uno H, Cronin AM, et al. : Detecting lung and colorectal cancer recurrence using structured clinical/administrative data to enable outcomes research and population health management. Med Care 55:e88-e98, 2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Uno H, Ritzwoller DP, Cronin AM, et al. : Determining the time of cancer recurrence using claims or electronic medical record data. JCO Clin Cancer Inform 2:1-10, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ritzwoller DP, Fishman PA, Banegas MP, et al. : Medical care costs for recurrent versus de novo stage IV cancer by age at diagnosis. Health Serv Res 53:5106-5128, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hassett MJ, Uno H, Cronin AM, et al. : Comparing survival after recurrent vs de novo stage IV advanced breast, lung, and colorectal cancer. JNCI Cancer Spectr 2:pky024, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hassett MJ, Uno H, Cronin AM, et al. : Survival after recurrence of stage I-III breast, colorectal, or lung cancer. Cancer Epidemiol 49:186-194, 2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Carroll NM, Ritzwoller DP, Banegas MP, et al. : Performance of cancer recurrence algorithms after coding scheme switch from International Classification of Diseases 9th revision to International Classification of Diseases 10th revision. JCO Clin Cancer Inform 3:1-9, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hassett MJ, Banegas M, Uno H, et al. : Spending for advanced cancer diagnoses: Comparing recurrent versus de novo stage IV disease. J Oncol Pract 15:e616-e627, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
