Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jul 3.
Published in final edited form as: Proc (IEEE Int Conf Healthc Inform). 2018 Jul 26;2018:328–331. doi: 10.1109/ICHI.2018.00045

Lessons Learned in the Development of a Computable Phenotype for Response in Myeloproliferative Neoplasms

Evan Sholle 1, Spencer Krichevsky 2, Joseph Scandura 3, Claudia Sosner 4, Thomas R Campion Jr 5
PMCID: PMC6608705  NIHMSID: NIHMS1035454  PMID: 31276120

Abstract

Determining response status in patients with myeloproliferative neoplasms is a complex problem requiring the integration of both structured and unstructured data elements from disparate information systems. By applying multiple techniques, a collaborative team of informatics professionals and research personnel were able to determine which elements were amenable to automated extraction and which required expert adjudication. With this knowledge in mind, we were able to build a system that joins together programmatically-derived and manually-abstracted data elements to facilitate response assessment – an important end point in clinical and translational research in this disease area.

Keywords: clinical research informatics, secondary use, cancer, natural language processing, data mining, computable phenotype

I. INTRODUCTION

Myeloproliferative neoplasms (MPNs) constitute a group of malignancies of the hematopoietic stem cells that share certain clinical features. MPNs include essential thrombocytosis (ET), polycythemia vera (PV), myelofibrosis (MF), chronic myeloid leukemia (CML), and others. In assessing the effectiveness of treatments for these conditions, the International Working Group – Myeloproliferative Neoplasms Research and Treatment (IWG-MRT) and the European LeukemiaNet (ELN) have issued a serious of consensus document reports detailing criteria for determining response in various MPNs, including MF [1]. Response is determined according to multiple factors, including blast count in bone marrow biopsy, cellularity, fibrosis, splenomegaly, transfusion dependence, platelet count, and others. Response may vary along the course of a patient’s treatment as the patient is induced on various lines of chemotherapy, responds, enters remission, relapses, and progresses. However, it remains the standard metric in assessing the efficacy of differing modes of treatment for myeloproliferative neoplasms, as well as other liquid tumors.

The current gold standard for determining response in clinical trials requires manual review of the patient’s electronic health record (EHR) by trained personnel. As seen in Table I, while some of the data required for the assessment of response exist in a structured fashion, some exist in free text (e.g. biopsy reports). As such, research personnel must review the patient’s chart manually, entering the required data elements into a research data capture tool to facilitate the assessment of response status – a time-consuming and difficult process.

TABLE I.

ELEMENTS REQUIRED TO ASSESS RESPONSE

Data element Structure
Blast count Semi-structured – regularly expressed free text in bone marrow biopsy report
Laboratory values Structured – tabular format in EHR
Cellularity Unstructured – documented in free text in bone marrow biopsy report with high degree of lexical variation
Fibrosis Unstructured – documented in free text in bone marrow biopsy report with high degree of lexical variation
Splenomegaly Unstructured – while ICD-9/10 codes exist, most often documented in free text in progress note
Cytogenetics Semi-structured using International System for Human Cytogenetic Nomenclature
Genomic data Structured – HL7 feed from genomic information system (GIS)
Transfusion dependence Structured – observations from inpatient electronic health record

Computable phenotyping, a process by which informaticians use structured definitions to mine data from the EHR in order to determine which patients meet specific clinical criteria, is an established approach towards translating difficult-to-define clinical concepts into concrete, computable categories.

We hypothesized that by using existing data extraction techniques piloted as part of efforts to create a research data repository [2] for the Richard T. Silver Myeloproliferative Neoplasms Center at Weill Cornell Medicine (WCM), including development of an instance of the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) [3] and custom natural language processing (NLP) pipelines utilizing the Leo platform [4], we could alleviate the burden of manual chart review and data extraction by extracting as many of the elements as possible, allowing research personnel to focus their efforts on expert adjudication rather than manual chart review and data entry. Theoretically, by extracting all of the required elements for the IWG-MRT definition of response, the Center could develop a computable phenotype for response, operationalizing the group’s response definition as an algorithm that could be applied to structured data extracted from the EHR, along with NLP-derived elements extracted from unstructured data.

II. METHODS

A. Setting

Weill Cornell Medicine (WCM) is an academic medical center on Manhattan’s Upper East Side. An academic staff of approximately 1000 physicians treat patients at more than 20 sites in New York City, utilizing the EpicCare Ambulatory EHR. The Richard T. Silver, M.D. Myeloproliferative Neoplasms Center (hereafter the Center) conducts patient care as well as cutting-edge clinical research designed to understand the cause, progression and treatment of MPNs.

In conjunction with the Research Informatics (RI) division of WCM’s Information Technologies and Services department, the Center has embarked on the creation of a research data repository (RDR) designed to facilitate the integration of EHR data from multiple systems to facilitate cohort discovery, data collection, and analysis. [2] The RDR contains patient data from both research and clinical systems, including data captured in REDCap and an instance of the OMOP CDM. To support research personnel’s efforts to assess patient response, RI and Center researchers have implemented strategies designed to support extraction of the elements required to assess response, dependent on the state in which the data exists (as detailed in Table I).

B. Structured Data Extraction

Extraction of the structured data elements used to assess response is the most straightforward component of our approach. Using structured query language (SQL) queries, we were able to pivot structured laboratory results data from the Center’s instance of the OMOP CDM in such a fashion as to allow research personnel to view the data on a longitudinal basis, determining when individual patients had certain hematological values, including absolute neutrophil count, platelet count, and others (see Fig. 1).

Fig. 1.

Fig. 1.

Extraction of structured laboratory values from OMOP CDM.

Other structured data elements, including next-generation sequencing data and transfusions, were similarly amenable to structured extraction and transformation from their native source systems. However, extraction of many of the other elements required the application of techniques with a significantly higher degree of methodological and technological sophistication. To extract blast counts from bone marrow biopsies, we initially employed a series of SQL queries dependent on regular expressions to extract blast counts, which, despite the appearance of regular structure, enter our EHR as freetext from an ancillary pathology application. After initial review, we determined that this approach did not have the requisite sensitivity or specificity to extract blast counts with sufficient rigor to support systematic assessment of response in MPN patients, in part due to lexical variation inherent in bone marrow biopsy reports.

C. Natural Language Processing

To further support the extraction of elements of interest from bone marrow biopsies, the RDR team and the Center research personnel engaged in an iterative definitional process to identify the elements from the bone marrow biopsy required to determine response, as detailed in Table II. These included blast count, cellularity, and fibrosis – however, each bone marrow biopsy included multiple observations, both from the aspirate and the clot section, necessitating section detection and tagging to extrapolate the source of the observation and label it accordingly. Furthermore, cellularity was recorded both on a quantitative (20%, 30%, etc) and a qualitative (hypo-, normo-, hyper-) basis, whereas fibrosis was recorded both on a quantitative and qualitative basis from both reticulin and Masson trichrome staining – all distinctions with clinical significance. Ultimately, we identified ten distinct target concepts requiring structured extraction.

TABLE II.

BONE MARROW BIOPSY NLP TARGETS

Concept Data type
Biopsy blast count Numeric
Biopsy cellularity – quantitative Numeric
Biopsy cellularity- qualitative Categorical
Biopsy fibrosis – grade Categorical
Biopsy fibrosis – qualitative - reticulin String
Biopsy fibrosis – qualitative - trichrome String
Aspirate blast count - differential Numeric
Aspirate cellularity – quantitative Numeric
Aspirate cellularity – qualitative Categorical
Aspirate blast count – flow cytometry Numeric

After identifying the target concepts, the RDR team developed a natural language pipeline using Leo to extract the target concepts on a per-report basis and extracted the results into a SQL Server environment for validation by Center personnel.

D. Manual Data Capture

Other elements required for the assessment of response do not lend themselves as neatly to natural language processing techniques. Splenomegaly, for example, can be identified either from mentions within a progress note – complicated by the need for sophisticated negation detection approaches – or by specific dimensions denoted in imaging report. Likewise, karyotypes are theoretically determined by a standardized grammar – the International System for Human Cytogenetic Nomenclature. However, human error and imperfect adherence to the theoretical structure of the notation makes structured decomposition according to regular expression or computed techniques a difficult technique. While we hope to implement NLP techniques that can extract these concepts, the Center research personnel are expert in determining these values from a patient’s chart. Using REDCap, an established electronic data capture system, they record these values on a structured basis for individual patients.

III. RESULTS

Utilizing a Microsoft SQL Server-based approach, we are able to integrate the various data elements in order to facilitate the structured assessment of response. Research personnel actively enter cytogenetics data and splenomegaly, as well as other pertinent data elements, into a REDCap project. The RDR team regularly extracts data from this REDCap project into a tabular format in the SQL Server environment as part of the SUPER data ingestion process [5]. Upon accession to the SQL server environment, the REDCap data is pivoted and loaded into a data mart using a stored procedure. It is then subject to ad hoc SQL queries from both the RDR team and the Center personnel that join manually abstracted data to both structured data from the OMOP CDM/genomic information systems and to structured natural language processing data extracted from freetext, as detailed in Fig. 2.

Fig. 2.

Fig. 2.

Simple pseudocode demonstrating join from manually abstracted REDCap data to automatically extracted data from EHR and GIS

Utilizing this technique, Center personnel can easily join data elements that require manual abstraction with elements that are amenable to structured extraction in an agile, modular fashion. By configuring the SQL query, parameters can easily be adjusted to widen or narrow temporal windows and integrate additional components as needed. Once the requisite data elements have been aggregated, they can then conduct expert adjudication to determine response status at a given time and enter it into REDCap to facilitate clinical and translational research.

IV. CONCLUSIONS

Despite the efforts we detail here, it is important to emphasize that we are not certain of the feasibility, or even the desirability, of a purely computable approach to a response phenotype for response in MPNs. We recognize the intrinsic complexity of the data elements required to compute response, and the institutional barriers towards implementing the level of structured data capture that would be required to fully enable the algorithmic determination of response – for the foreseeable future, expert determination will still be key in determining response. However, expert adjudication does not necessarily extend to data entry – elements that are amenable to structured extraction should be subject to this process in order to ensure that research personnel are focusing their efforts on processes that leverage the unique human ability to resolve ambiguity and parse ambiguous clinical narratives. Supplementing manually gathered REDCap data with data extracted from the EHR offers the potential of allowing research personnel to focus their efforts these tasks, rather than copy-pasting data from the EHR into an electronic data capture form.

We recognize multiple limitations to this work, not least including the inherent difficulty of natural language processing in this domain and the continuing requirement for extensive human effort, as well as the lack of formal validation of our NLP pipeline for extracting structured data from bone marrow biopsies. Future efforts will focus on the formal validation of existing natural language processing techniques, as well as the extension of the Leo pipeline to capture the data elements that still require expert adjudication – particularly with an eye towards a structured decomposition of ISCN karyotype annotation.

Working along these lines, we aim to develop a human-supervised system rather than a human-dependent system. We also hope to expand this technique beyond the domain of myeloproliferative neoplasms, as facilitating the detection of response status holds the potential to benefit the research enterprise in multiple domains. While this particular use case is tailored to MPNs, multiple liquid tumor disease areas share similar response features. The application of a computable phenotype supported with both structured data and the output of NLP processes could have significant benefit in reducing the effort required to determine a significant clinical endpoint in this domain.

ACKNOWLEDGMENTS

We thank the Research Informatics team (Prakash Addekkanatthu, Marcos Davila, Steven Flores, Xiaobo Fuld, Joseph Kabariti, David Kraemer, Ryan McGregor, Sean Pompea, Julian Schwartz, and Jacob Weiser) for their contributions to the efforts detailed herein.

This study received support from NewYork-Presbyterian Hospital (NYPH) and Weill Cornell Medical College (WCMC), including the Clinical and Translational Science Center (CTSC) (UL1 TR000457) and Joint Clinical Trials Office (JCTO).

Contributor Information

Evan Sholle, Information Technologies & Services, Weill Cornell Medicine, New York, NY.

Spencer Krichevsky, Department of Medicine, Weill Cornell Medicine, New York, NY.

Joseph Scandura, Department of Medicine, Weill Cornell Medicine, New York, NY.

Claudia Sosner, Department of Medicine, Weill Cornell Medicine, New York, USA.

Thomas R. Campion, Jr., Department of Healthcare Policy and Research, Weill Cornell Medicine, New York, USA.

REFERENCES

  • [1].Tefferi A, Cervantes F, Mesa R, Passamonti F, Verstovsek S, Vannucchi AM, et al. Revised response criteria for myelofibrosis: International Working Group-Myeloproliferative Neoplasms Research and Treatment (IWG-MRT) and European LeukemiaNet (ELN) consensus report. Blood. 2013;122:1395–8.J. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. J Am Med Inform Assoc. 2013;20(1):117–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Sholle ET, Bollapragada R, Campion TR. Research data repositories: a tailored approach to secondary use of electronic health record data. AMIA Jt Summits Transl Sci Proc; 2016; San Francisco, CA [Google Scholar]
  • [4].Observational Health Data Sciences and Informatics. Data Standardization [Internet]. Washington, DC: Observational Heatlh Data Sciences and Informatics; [cited 2017. Sep 25]. Available from: https://www.ohdsi.org/data-standardization/. [Google Scholar]
  • [5].Sholle ET, Kabariti J, Johnson SB, Leonard JP, Pathak J, Varughese VI, Cole CC, Campion TR. Secondary use of patients’ electronic records (SUPER): an approach for meeting specific data needs of clinical and translational researchers. AMIA Annu Symp Proc; 2017; Washington, DC. [PMC free article] [PubMed] [Google Scholar]

RESOURCES