PURPOSE
Outcomes for patients with metastatic breast cancer (MBC) are continually improving as more effective treatments become available. Granular data sets of this unique population are lacking, and the standard method for data collection relies largely on chart review. Therefore, using electronic health records (EHR) collected at a tertiary hospital system, we developed and evaluated a computational phenotype designed to identify all patients with MBC, and we compared the effectiveness of this algorithm against the gold standard, clinical chart review.
METHODS
A cohort of patients with breast cancer were identified according to International Classification of Diseases codes, the institutional tumor registry, and SNOMED codes. Chart review was performed to determine whether distant metastases had occurred. We developed a computational phenotype, on the basis of SNOMED concept IDs, which was applied to the EHR to identify patients with MBC. Contingency tables were used to aggregate and compare results.
RESULTS
A total of 1,741 patients with breast cancer were identified using data from International Classification of Diseases codes, the tumor registry, and/or SNOMED concept identifiers. Chart review of all patients classified each patient as having MBC (n = 416; 23.9%) versus not (n = 1,325; 75.9%). The final computational phenotype successfully classified 1,646 patients (95% accuracy; 82% sensitivity; 99% specificity).
CONCLUSION
Hospital systems with robust EHRs and reliable mapping to SNOMED have the ability to use standard codes to derive computational phenotypes. These algorithms perform reasonably well and have the added ability to be run at disparate health care facilities. Better tooling to navigate the polyhierarchical structure of SNOMED ontology could yield better-performing computational phenotypes.
INTRODUCTION
Breast cancer is the most common noncutaneous malignancy in women in the United States, and although the majority of women will be cured of their disease, up to 20%-30% will ultimately develop metastatic breast cancer (MBC), a noncurable and deadly disease.1,2 In addition, approximately 5%-10% of women will already have metastatic disease at the time of their initial diagnosis.3,4 Among these women, survival outcomes have been steadily improving over recent years as more effective treatments have become available,3 which has led to some interest in stratifying patients with MBC into subgroups with distinct outcomes.5 However, investigations of this unique population are limited because of the lack of large granular data sets. For example, the Commission on Cancer–accredited tumor registries report to the National Cancer Database and follow North American Association of Central Cancer Registries reporting guidelines defined in the Standards for Oncology Registry Entry (STORE) manual. However, tumor registries contain information limited by regulatory requirements. By contrast, single-institution databases (eg, medical specialty database) may contain more granular data on treatments but lack the size and diversity of the larger national databases. Furthermore, the creation of many single-institution databases, or even most tumor registries, relies largely on clinical chart review, a time-consuming process.
CONTEXT
Key Objective
Metastatic breast cancer is a noncurable disease affecting 20%-30% of women diagnosed with breast cancer. There is interest in stratifying patients with metastatic disease as survival outcomes improve and more effective treatments become available. Identifying a cohort of patients with this disease can be complex within electronic health records. This manuscript identifies a method capable of easily identifying these patients and also provides code for easy implementation.
Knowledge Generated
At medical facilities that use Epic and Intelligent Medical Objects, the ability to use SNOMED CT codes as computational phenotypes is available. Structured Query Language can be used to codify a particular phenotype. This code enables sharing logic for phenotype implementation among nonaffiliated institutions.
Relevance
Accurately identifying a cohort of individuals with a particular phenotype is of interest in retrospective and prospective studies. With the ability to identify specific individuals with metastatic breast cancer, research teams will have the means to investigate this unique population.
Currently, Epic is the largest electronic health record (EHR) provider in the United States by market share.6 If a method for identifying these patients could be developed within Epic's EHR (henceforth called Epic), it could be applied at numerous institutions, creating a multi-institutional database with the breadth and depth necessary to answer meaningful questions related to this population. Clearly defining reproducible, valid, and reliable algorithms capable of identifying phenotypes within populations is a defined goal of future clinical registries and clinical trials.7 Others have presented the ideal framework that needs to be built to achieve these goals.8 One example of a platform built for sharing these algorithms is the Phenotype KnowledgeBase,9 an online repository enabling the sharing of phenotypic logic (ie, computational phenotype) metadata. The need to translate this metadata into code for implementation is one limitation of this system. In this paper, we propose an implementation-ready solution via Structured Query Language (SQL).
As early as 2005,10 Epic began collaborating with Intelligent Medical Objects (IMO) to receive a detailed map linking clinical interface terminology to International Classification of Diseases, Clinical Modification (ICD-CM), and SNOMED codes. Importantly, the diagnosis names providers select at a visit or add to problem lists are not the International Classification of Diseases (ICD) code names, but are IMO terms. All downstream coding (ICD codes and SNOMED concepts) come from IMO. Figure 1 provides an overview of the relationship between one clinical concept and ICD-CM/SNOMED codes. IMO is not the only vendor providing this solution. Epic offers other options for mapping clinical concepts (eg, Health Language or the National Library of Medicine). In this study, we used the IMO and Epic collaboration to define two SNOMED computational phenotypes capable of identifying patients with MBC. These algorithms were evaluated at a tertiary hospital system and compared against the gold standard, clinical chart review.
FIG 1.
Epic and IMO enabled SNOMED computational phenotype workflow. A list of SNOMED browsers can be found on the National Institutes of Health website.15 ICD-10-CM, International Classification of Diseases, 10th Revision, Clinical Modification; IMO, Intelligent Medical Objects.
METHODS
Evaluation Population
Distant metastatic breast cancer (DMBC) was defined according to the Eighth Edition of the American Joint Committee on Cancer Cancer Staging Manual clinical interface terminology.11 To evaluate how well our algorithm identified patients as having DMBC or not, patients diagnosed with breast cancer (both metastatic and nonmetastatic) were identified. The cohort needed to have patients with and without metastatic disease to determine the effectiveness of the algorithm. Identifying a cohort of patients with breast cancer can be achieved using a variety of approaches. For example, at least three methods can be used to identify such a cohort: (1) queries of the EHR using ICD-CM codes,12 (2) the institutional tumor registry, or (3) SNOMED codes. Table 1 presents the search criteria used to derive each of these three populations.
TABLE 1.
Criteria for Identifying Patients With Breast Cancer
Groupings of ICD codes can be used to derive phenotypes and are often on the basis of personal experience, or taken from professional publications such as those maintained by Centers for Medicare & Medicaid Services.13 The main limitation in using ICD-only coding is the granularity of the codes. Curating a list of codes to identify a specific phenotype can be labor-intensive, requiring validation between a clinician and an analyst. Our team used clinical expertise to aggregate the codes in Table 1.
Within oncology, many researchers rely on the tumor registry to derive cohorts for research. This is a system outside of the EHR, which is mandated by the Federal government but maintained by individual institutions. Our team worked with cancer registrars to define a cohort of patients with breast cancer from the tumor registry (Table 1), using the variable primary site from the STORE manual14 with a value of C5%.
Using the workflow in Figure 1, we identified the SNOMED concept 254837009 (malignant neoplasm of breast) by searching for the term breast cancer within a SNOMED browser. Once this concept ID was found, we used the mappings within EPIC to identify every patient associated with a diagnosis of this type.
Before pulling the data, we considered a subset of dates that would allow for meaningful comparison among all three approaches. The EHR and tumor registry were searched for patients between January 1, 2015, and July 1, 2018. This provided three and a half years of data to define a cohort. All three methods of identification relied on diagnosis information entered into the EHR. Searching for these codes only determines if a patient was associated with a diagnosis because of billing, encounter, problem list, hospital problem list, admitting diagnoses, or discharge diagnoses. Incorporating other criteria such as pathology notes or laboratory values could enhance the performance of a computational phenotype, but the intent was to evaluate how well a diagnosis-only approach would work.
Each source of data yielded more than 20,000 potential cases for our study (Fig 2). A total of 1,744 patients with breast cancer were selected for this study using a stratified sampling plan. Strata were based on origin of breast cancer diagnosis (ie, ICD codes, the tumor registry, and/or SNOMED concept identifiers). Because of migration of patient identifiers, three patients could not be mapped across these various data sources and had to be removed from the sample. Our final cohort consisted of 1,741 patients (Fig 2). Duke University Health System's Institutional Review Board reviewed and approved this project.
FIG 2.
Stratified sample scheme. ICD-CM, International Classification of Diseases, Clinical Modification.
Computational Phenotype Development—Tumor Registry
To define an algorithm capable of identifying DMBC from the tumor registry, our team worked with tumor registrars to identify STORE variables and values associated with this phenotype. Specifically, we used the variable “Type of First Recurrence.”14 Using the values listed in Table 2, any patient recorded with a value in that list was considered to have DMBC.
TABLE 2.
Distant Metastatic Computational Phenotype Definitions
Computational Phenotype Development—SNOMED
Two SNOMED computational phenotypes were built to identify DMBC using the logic in Figure 1. First, we chose a SNOMED browser from a list of resources on the National Library of Medicine's website.15 Using this resource, we searched for the pertinent SNOMED concept IDs via free text. For instance, we used the search term “metastasis from” and identified “metastasis from malignant tumor of breast” (315004001). Once the concept ID was found, we built a SQL query16 to identify all individuals with a diagnosis associated with the SNOMED concept of interest. We expected this definition to overidentify patients, also including patients with regionally metastatic disease (ie, only ipsilateral axillary lymph node metastases). In an attempt to remove regional metastases, a second computable phenotype (SQL query) was derived using all patients with a SNOMED concept ID of 315004001 and then removing all patients who also had a SNOMED concept of 94392001 (secondary malignant neoplasm of lymph nodes).
RESULTS
For our cohort, 1,741 patient charts were reviewed for indication of DMBC. The first computational phenotype using only SNOMED code 315004001 was able to identify patients with MBC, but at the expense of over identifying 136 patients who did not truly have distant metastases (Table 3). After reviewing five random charts, it was determined that two patients were confirmed to have DMBC (diagnoses found in notes not related to oncology); two patients did not have distant metastatic disease but had likely nodal metastases; and one patient had a biopsy related to an investigation of metastases, which was negative. This methodology had the highest naive sensitivity at 87.3%, but the lowest naive specificity at 89.7% (Table 4).
TABLE 3.
Comparison of Identification Rates of Patients With Metastatic Breast Cancer
TABLE 4.
Summary of Naive Performance Measures for Each Method of Identifying Patients With Metastatic Breast Cancer (on the basis of data from Table 3)
Our second computable phenotype, which used the SNOMED code of metastasis from malignant tumor of breast but excluded any patients who also had a SNOMED code with secondary malignant neoplasm of lymph nodes, had a slight reduction in the total number of identified patients with MBC. However, this coding drastically reduced the absolute false-positive number to 19 patients (down from 136). In reviewing five random charts, we determined that these patients were misclassified because three patients were confirmed to have MBC (two diagnoses found in notes not related to oncology; one mediastinal metastasis); one patient had metastases to the breast skin, not considered distant metastasis; and one patient had a nodal metastasis. This computational phenotype had the highest accuracy and naive specificity at 94.5% and 98.6%, respectively, but a lower naive sensitivity than the first phenotype (81.7% v 87.3%; Table 4).
DISCUSSION
Granular data for patients with MBC are lacking, and as outcomes for these patients continue to improve, refined analyses will be required to provide further insight into the best treatment algorithms and expected prognoses for this unique population. To help achieve this goal, we sought to develop a computational phenotype that could identify patients with MBC.
Research teams often start with ICD code groupings to identify the phenotype of interest and use this as a benchmark for other approaches.17 These groupings may omit codes that would be helpful in identifying the phenotype of interest. For example, it is possible that a woman could initially be identified as a patient with breast cancer and subsequently missed for recurrence because of the research team not searching for a specific code like C79.31 (secondary malignant neoplasm of brain). Inconsistent use of secondary malignancy codes can lead to poorly performing computational phenotypes. When using ICD codes, every conceivable code needs to be accounted for individually, to avoid missing or inaccurate groups of codes.
The tumor registry is often used as a resource for defining cohorts in oncology research settings. However, when using the tumor registry, there are limitations that must be noted, depending on the data requirements. First, relying on ICD coding has the potential to miss certain patients, especially as codes evolve over time. Second, investigators who wish to derive a cohort of patients with a specific type of cancer will only identify patients who met SEER eligibility. If a patient was initially diagnosed and treated at an outside hospital, the tumor registry will only include them if the patient subsequently presents with evidence of active disease and receives treatment or surveillance at that institution.
In comparing the tumor registry results with those of the gold standard clinical chart review, we found that a primary cause of differences was that the tumor registry, by design, only includes first recurrence. Therefore, any subsequent recurrence found by alternate methods will not be included in the tumor registry. The tumor registry had a high naive specificity at 96.2%, but the lowest naive sensitivity at only 59.6% when compared against chart review (Table 4).
In our novel approach, we used IMO clinical concepts, the backbone of diagnosis coding in our health system. These terms map to both SNOMED and ICD-CM codes (Fig 1). Using SNOMED as a grouper for diagnoses associated with DMBC is faster than curating a list of ICD-CM codes. A broad query of the EHR yielded a stratified sample of 1,741 patients available to review and evaluate this methodology. Upon clinical chart review, 23.9% were found to have DMBC. Application of our computational phenotype to identify the subset of patients with MBC had an overall accuracy of 94.5%, naive specificity of 98.6%, and naive sensitivity of 81.7%, suggesting that it still misclassified a large number of patients as false negatives. This was most likely because of the removal of all secondary malignant neoplasm of lymph nodes. In hindsight, using a more tailored SNOMED code (eg, only those occurring in the axillary lymph nodes) may have eliminated the number of false negatives and increased the sensitivity.
Ultimately, we sought to build a computational phenotype that could be scaled to many organizations and potentially result in a national, multi-institutional registry of patients with MBC, and our preliminary findings suggest this is possible. Other authors have described the difficulty of implementing phenotypes across sites.7 In the past, there has been a need to harmonize discrepant EHRs to ensure proper integration. However, the adoption of Epic among many of the largest research organizations18 may obviate the need for the development and implementation of common data models for this purpose. If proven, this strategy could provide a more streamlined avenue to data sharing and analytics.
Beyond the advantages of our proposed pipeline enabling open text searches of phenotypes within the EHR, SNOMED enables additional functionality worth noting. Because SNOMED is considered a polyhierarchical structure, additional information can be derived once the appropriate SNOMED concept identifier is located.
As an example, consider the case where all patients with MBC were selected by identifying the SNOMED concept 315004001. Because this concept is in the polyhierarchical structure of SNOMED, we are also able to create analytic variables that will be important for other analyses. Some additional analytic variables that could be derived are the body site from which this tumor originated (eg, neoplasm of trunk). In cases of rare diseases, the ability to roll up similar diseases into a category that has larger sample sizes is a key first step. SNOMED provides a way to accomplish this goal.
In oncology, another important concept is to derive not only the site, but also whether the tumor at a given site is primary or secondary. Given the SNOMED concept ID of 315004001, a research team would also be able to identify that the breast cancer was primary and other secondary tumors are also present at other sites. In fact, the workflow established in Figure 1 allows users to identify the body site where metastases are present (eg, bone). Identification of features such as these is a crucial step in any analytics pipeline that wishes to go beyond cohort identification. For this reason, it is worth noting that the information added in this step goes above and beyond what is available in ICD code groupings.
Within Epic, a diagnosis grouper can be built using ICD codes, SNOMED concepts, or even other codes. A listing of the grouper components can then be shared and rebuilt at other institutions, or the grouper itself could be exported and imported across institutions using back-end tools and macros. However, whether that grouper identifies the same phenotype when loaded in Epic at another health system depends on additional factors. First, the grouper must be built using shared/standard codes. To ensure a grouper finds an analogous population, both institutions must be using the same version of SNOMED, as well as the same IMO mapping tables. Although IMO updates its dictionaries multiple times throughout the year, these updates must be applied by each health system at the local level. Competing priorities and limited resources make the frequency of IMO updates highly variable. Therefore, a shared Epic diagnosis grouper using the same SNOMED concepts could provide differing ICD results. Epic does not currently offer the option to bypass SNOMED and easily build groupers directly using IMO clinical interface terms.
Our MBC computational phenotype is available as a SQL query within the Epic Report Repository.16 To run this query, an analyst would need access to the EPIC Caboodle data warehouse. Our team used Azure Data Studio (v1.4.5) to derive and run this query. While performing this research, our version of Epic had SNOMED US Edition March 2020 and IMO's September 2020 diagnosis terminology loaded.
Our current solution requires institutions to properly map clinical terminology to SNOMED codes. Institutions that do not have access to this mapping cannot use this method. Because of the polyhierarchical nature of SNOMED, correctly identifying the concept ID(s) of interest can be difficult. In this manuscript, we discussed removing all diagnoses tagged with the SNOMED concept of 94392001 (secondary malignant neoplasm of lymph node) to more specifically define DMBC. However, someone more familiar with SNOMED or clinical concepts might have chosen to remove only SNOMED concept 94181007 (secondary malignant neoplasm of axillary lymph nodes). Familiarity with SNOMED codes is a rate-limiting step in deriving a well-performing computational phenotype.
Our study did not implement a rigorous statistical design that would have enabled us to effectively estimate the unbiased operating characteristics of our computational phenotypes. We also did not compare this algorithm against other existing computational phenotypes. In the future, designing a study accounting for verification bias and comparing against benchmark computational phenotypes would be informative.
Also, this study did not have an adjudicated review process to determine the true status of distant metastases. In a subsequent assessment of false positives, we found contradicting results. This contradicting evidence was usually the result of distant metastatic disease documented in notes unrelated to oncology (eg, mention disease in an emergency department visit).
Future studies should incorporate an adjudicated review process to reduce the number of discrepant findings. Moving forward, our team plans to alter our query to include more precise exclusion and inclusion criteria to increase the performance of the query. We would like to work with other institutions using Epic and IMO to validate our computational phenotype for MBC. This would enable us to realize our end goal of establishing a national, multi-institutional registry of patients where more tailored guidelines could be developed. By sharing our algorithm within the Epic Report Repository, other institutions can easily implement this method and verify these results.
Steve Power
Stock and Other Ownership Interests: Merck
Andrew Kanter
Employment: Intelligent Medical Objects Inc (I)
Leadership: Intelligent Medical Objects Inc
Stock and Other Ownership Interests: Intelligent Medical Objects Inc
Travel, Accommodations, Expenses: Intelligent Medical Objects Inc
Terry Hyslop
Consulting or Advisory Role: AbbVie
Travel, Accommodations, Expenses: AbbVie
No other potential conflicts of interest were reported.
SUPPORT
Supported in part by Duke Cancer Institute through NIH Grant No. P30CA014236 (PI: Kastan) for the Biostatistics Core.
AUTHOR CONTRIBUTIONS
Conception and design: Benjamin Neely, Claire Howell, Terry Hyslop, Jennifer K. Plichta
Administrative support: Jennifer K. Plichta
Provision of study materials or patients: Jennifer K. Plichta
Collection and assembly of data: Benjamin Neely, Caitlin E. Marks, Steve Power, Terry Hyslop, Jennifer K. Plichta
Data analysis and interpretation: Benjamin Neely, Mohammad Shahsahebi, Steve Power, Andrew Kanter, Terry Hyslop, Jennifer K. Plichta
Manuscript writing: All authors
Final approval of manuscript: All authors
Accountable for all aspects of the work: All authors
AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST
The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.
Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).
Steve Power
Stock and Other Ownership Interests: Merck
Andrew Kanter
Employment: Intelligent Medical Objects Inc (I)
Leadership: Intelligent Medical Objects Inc
Stock and Other Ownership Interests: Intelligent Medical Objects Inc
Travel, Accommodations, Expenses: Intelligent Medical Objects Inc
Terry Hyslop
Consulting or Advisory Role: AbbVie
Travel, Accommodations, Expenses: AbbVie
No other potential conflicts of interest were reported.
REFERENCES
- 1.American Cancer Society : Cancer Facts & Figures 2020. Atlanta, GA, American Cancer Society, 2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kennecke H, Yerushalmi R, Woods R, et al. : Metastatic behavior of breast cancer subtypes. J Clin Oncol 28:3271-3277, 2010 [DOI] [PubMed] [Google Scholar]
- 3.Taskindoust M, Thomas SM, Sammons SL, et al. : Survival outcomes among patients with metastatic breast cancer: Review of 47,000 patients. Ann Surg Oncol 28:7441-7449, 2021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Tao L, Chu L, Wang LI, et al. : Occurrence and outcome of de novo metastatic breast cancer by subtype in a large, diverse population. Cancer Causes Control 27:1127-1138, 2016 [DOI] [PubMed] [Google Scholar]
- 5.Plichta JK, Thomas SM, Sergesketter AR, et al. : A novel staging system for de novo metastatic breast cancer refines prognostic estimates. Ann Surg 275:784-792, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Landi H: Epic, Meditech Gain U.S. Hospital Market Share as Other EHR Vendors Lose Ground. Fierce Healthcare, 2020. https://www.fiercehealthcare.com/tech/epic-meditech-gain-u-s-hospital-market-share-as-other-ehr-vendors-lose-ground [Google Scholar]
- 7.Richesson RL, Sun J, Pathak J, et al. : Clinical phenotyping in selected national networks: Demonstrating the need for high-throughput, portable, and computational methods. Artif Intell Med 71:57-61, 2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mo H, Thompson WK, Rasmussen LV, et al. : Desiderata for computable representations of electronic health records-driven phenotype algorithms. J Am Med Inform Assoc 22:1220-1230, 2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kirby JC, Speltz P, Rasmussen LV, et al. : PheKB: A catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc 23:1046-1052, 2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Beaudoin J: Epic Deal Thrusts IMO Into Limelight. Healthcare IT News, 2005. https://www.healthcareitnews.com/news/epic-deal-thrusts-imo-limelight [Google Scholar]
- 11.Giuliano AE, Edge SB, Hortobagyi GN: Eighth edition of the AJCC cancer staging manual: breast cancer. Ann Surg Oncol 25:1783-1785, 2018 [DOI] [PubMed] [Google Scholar]
- 12.World Health Organization : ICD-10 Version: 2016. International Statistical Classification of Diseases and Related Health Problems 10th Revision. 2015. https://icd.who.int/browse10/2016/en [Google Scholar]
- 13.Healthcare Cost & Utilization Project : Clinical Classifications Software (CCS) for ICD-10-PCS (Beta Version), 2019. 2020. www.hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp [Google Scholar]
- 14.American College of Surgeons : Standards for Oncology Registry Entry. STORE. 2018. https://www.facs.org/media/0ksm02ka/store_manual_2018.pdf [Google Scholar]
- 15.National Library of Medicine : SNOMED CT Browsers, 2017. https://www.nlm.nih.gov/research/umls/Snomed/snomed_browsers.html [Google Scholar]
- 16.Neely B: Distant metastatic breast cancer computational phenotype. 2021. https://datahandbook.epic.com/Reports/Details/9000658
- 17.Spratt SE, Pereira K, Granger BB, et al. : Assessing electronic health record phenotypes against gold-standard diagnostic criteria for diabetes mellitus. J Am Med Inform Assoc 24:e121-e128, 2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Koppel R, Lehmann CU: Implications of an emerging EHR monoculture for hospitals and healthcare systems. J Am Med Inform Assoc 22:465-471, 2015 [DOI] [PMC free article] [PubMed] [Google Scholar]