Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies

Majid Afshar; Dmitriy Dligach; Brihat Sharma; Xiaoyuan Cai; Jason Boyda; Steven Birch; Daniel Valdez; Suzan Zelisko; Cara Joyce; François Modave; Ron Price

doi:10.1093/jamia/ocz068

. 2019 May 30;26(11):1364–1369. doi: 10.1093/jamia/ocz068

Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies

Majid Afshar ^1,^2,^✉, Dmitriy Dligach ^1,^2,³, Brihat Sharma ³, Xiaoyuan Cai ⁴, Jason Boyda ⁴, Steven Birch ⁴, Daniel Valdez ⁴, Suzan Zelisko ⁴, Cara Joyce ^1,², François Modave ^1,², Ron Price ^1,⁴

PMCID: PMC7647210 PMID: 31145455

Abstract

Objective

Natural language processing (NLP) engines such as the clinical Text Analysis and Knowledge Extraction System are a solution for processing notes for research, but optimizing their performance for a clinical data warehouse remains a challenge. We aim to develop a high throughput NLP architecture using the clinical Text Analysis and Knowledge Extraction System and present a predictive model use case.

Materials and Methods

The CDW was comprised of 1 103 038 patients across 10 years. The architecture was constructed using the Hadoop data repository for source data and 3 large-scale symmetric processing servers for NLP. Each named entity mention in a clinical document was mapped to the Unified Medical Language System concept unique identifier (CUI).

Results

The NLP architecture processed 83 867 802 clinical documents in 13.33 days and produced 37 721 886 606 CUIs across 8 standardized medical vocabularies. Performance of the architecture exceeded 500 000 documents per hour across 30 parallel instances of the clinical Text Analysis and Knowledge Extraction System including 10 instances dedicated to documents greater than 20 000 bytes. In a use–case example for predicting 30-day hospital readmission, a CUI-based model had similar discrimination to n-grams with an area under the curve receiver operating characteristic of 0.75 (95% CI, 0.74–0.76).

Discussion and Conclusion

Our health system’s high throughput NLP architecture may serve as a benchmark for large-scale clinical research using a CUI-based approach.

Keywords: natural language processing, unstructured data, clinical text and knowledge extraction system, data architecture, unified medical language system

INTRODUCTION

Information in the clinical narrative of the electronic health record (EHR) is a rich source of data, but its unstructured format renders it complex and difficult to utilize. Further, an estimated 80% of all data in an EHR system is expected to reside in an unstructured format.¹^,² These documents lack common structural frameworks, contain many grammatical and spelling errors and a high level of lexical variations, and are often semantically ambiguous. Methods in natural language processing (NLP) have proven effective in automatic semantic analyses of clinical documents¹ with a positive effect in clinical research,^3–5 but few data exist on prediction models like hospital readmission leveraging all the available EHR notes in a clinical data warehouse (CDW).

Transforming unstructured EHR data into research data for analyses requires substantial processing effort from NLP engines and data preparation accounts for more than 60% of the workload in producing a functional database.⁶ Modern clinical NLP engines such as the clinical Text Analysis and Knowledge Extraction System (cTAKES)⁷ are available as a solution for processing notes, but little guidance exists on optimizing their performance for an entire health system. Much of the existing literature involves processing notes for specific phenotyping tasks.³^,⁴^,^8–10 With an ever-increasing quantity of unstructured data in CDWs, a scaled and standardized approach may greatly improve research efficiency and throughput.

First, we aim to develop a high throughput NLP architecture using the cTAKES engine to concept map over 10 years of clinical documents from our CDW using the Unified Medical Language System (UMLS). Second, we aim to examine the application of our architecture in the context of a hospital 30-day readmission prediction task.

MATERIALS AND METHODS

Patient population and health care setting

Loyola University Medical Center (LUMC) is a licensed 559-bed hospital and tertiary academic center. LUMC has maintained Epic (Epic Systems Corporation, Verona, Wisconsin) as its EHR vendor since 2003, which includes a Microsoft SQL server-based CDW (Clarity). A data repository for clinical documents from the EHR spanning patient encounters between January 1, 2007 and June 30, 2018 was built for processing in the NLP architecture. The final data corpus used for the NLP architecture is comprised of 1 103 038 unique patients and 83 867 802 clinical documents.

On-premise data center architecture

The hardware and software used for the NLP architecture are detailed in Supplementary Material 1. Open source software including Apache Hive 1.1.0 (hive.apache.org), Python 2.7.5 (Python Software Foundation), JAVA SE 1.8.0 (JDK, www.oracle.com), and The Go Programming Language (Golang) 1.11.4 (golang.org/; Google, 2007) were prioritized for data processing and analysis. In the computational cluster, a pair of servers perform gateway functions to allow the import and export of clinical research data. Supported data import and export methods include file-based services such as network file services, server message block, secure file transfer protocol, secure copy protocol, and custom processes developed as Golang services.

The Hadoop data lake spans 725 terabytes in a Hadoop-distributed file system (HDFS) that is utilized to store the derived data sets for the NLP architecture. The file system utilizes Apache Hadoop MapReduce2 and YARN (https://hortonworks.com/apache/yarn) frameworks for distributed job processing. Other computational resources are connected to the HDFS through client software gateways provided by Cloudera (www.cloudera.com). For this study, 3 large-scale symmetrical multiprocessing (SMP) servers were provided to the cluster for computationally intensive processing with cTAKES. The final NLP architecture was constructed using the Hadoop data repository for source data and 3 SMP servers as NLP processors. A logical diagram of our center’s biomedical computing cluster is shown in Figure 1.

Figure 1. — Natural Language Processing (NLP) Architecture. Hardware elements for the NLP architecture. Process proceeds from left to right with extraction, transformation, and loading processes from the electronic health record to data storage in the Hadoop cluster to natural language processing in the symmetric multiprocessing servers to extracted data available for end users in the clinical research database.

NLP engine with the clinical text and knowledge extraction system (cTAKES)

Processing of the target 84 million documents began with the creation of an encrypted source data repository platform in Hadoop. The Golang microservice was connected to our existing clinical data warehouse to extract the clinical documents. The service aggregates and encrypts (AES 256-bit) source data for transfer to HDFS. Data blocks are partitioned in Hadoop on basis of type of document. Local Apache Hive data models were built to provide direct access to the accumulated documents. Hive queries were developed to support data quality and characterization activities. Additionally, 4 independent processes were developed for creation of annotated output files that are stored in XML Metadata Interchange format (Figure 2).

Figure 2. — Natural Language Processing (NLP) Architecture Workflow. The 6-step workflow from left to right with details at each step for procedures to be completed prior to proceeding to the next step. Document refers to any clinical note or report.

Abbreviations: ECD, extract code data; EHR, electronic health record; ETL, export, transform, and load.

Linguistic processing of all clinical documents was performed in cTAKES version 4.0.0 (http://ctakes.apache.org).⁷ Each named entity mention in the documents was mapped to a UMLS concept unique identifier (CUI) from the Metathesaurus and Semantic Network; then it was subsequently analyzed to determine its negation status using the cTAKES negation module. An additional 72 UMLS semantic types were included to extract social and behavioral determinants of health utilizing the latest dictionary lookup module from Apache cTAKES. The full list of UMLS semantic type unique identifiers are shown in Supplementary Material 2 and the source code is available in Apache cTAKES Subversion repository with associated documentation. The following medical vocabularies were mapped from the UMLS Metathesaurus: Current Procedural Terminology (CPT), Healthcare Common Procedure Coding System and its current CPT (HCPCS), International Classification of Disease 9th and 10th revision, Clinical Modification (ICD10CM/ICD9CM), Logical Observation Identifiers Names and Codes (LOINC), Medical Subject Heading (MeSH), Systemized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), and a standardized nomenclature for clinical drugs (RxNorm).

Use–case with prediction of 30-day unplanned hospital readmission

Application of the architecture was tested in a use–case example for prediction of hospital readmission across all inpatient encounters with available clinical documents (n = 215 708) between January 1, 2007 and September 30, 2017. The primary analysis was the binary classification of 30-day unplanned all-cause hospital readmission. All adult (ages 18 years and older) inpatient encounters in the LUMC patient population were examined. Each medical vocabulary was evaluated independently and in combination.

Cases were identified in accordance with Center for Medicare and Medicaid Services (CMS) rules for index admission and unplanned 30-day readmission.¹¹ To analyze data on diagnoses and procedures that met qualifying criteria for readmission, the Clinical Classification Software from the Agency for Healthcare Research and Quality was used to crosswalk with diagnoses from billing claims data in the EHR.¹²

Natural language processing and machine learning for use–case

The analysis cohort was divided into approximately 60% (n = 126 326) training, 20% (n = 44 677) validation, and 20% (n = 44 705) testing. The rate of readmission was similar across the 3 periods for the train, validation, and test data sets at 9.9%, 9.8%, and 10.4% respectively. The test set was performed in an out-of-sample test period (2015–2017) to represent the most recent clinical practice behaviors.¹³^,¹⁴ A term–frequency, inverse document–frequency (tf-idf) transformation was used to weigh the CUIs into normalized values for the machine learning classifier. N-grams (unigrams and bigrams) were also examined for comparison. CUIs and n-grams were inputs to a regularized logistic regression classifier in the training data set. Classifier hyperparameters were tuned to the highest area under the curve receiver operating characteristic (AUC ROC) in the validation data set.¹⁵ A grid search was performed to tune the type of regularization (L1 vs L2) and the regularization coefficient C. The goal of the experiment was to directly compare the utility of CUIs to n-grams with a model optimized for best AUC ROC. Recall/sensitivity, specificity, negative predictive value (NPV), and precision/positive predictive value (PPV) were also examined with their 95% confidence intervals. Analysis was performed using Python 3.6.5 (Python Software Foundation) and RStudio 1.1.463 (RStudio Team, Boston, MA). The Institutional Review Board of Loyola University Chicago approved this study.

RESULTS

Deployment and production of NLP architecture

Approximately 84 million clinical documents were processed and annotated into XMI files across all medical vocabularies in 13.33 days including downtime to manually change queue processing parameters. ICD-10 codes (n = 5.26 million codes) as a single medical vocabulary was processed in 1.55 days. Average processing times for single documents are listed in Table 1. Less than 0.1% of documents (n = 26 815) were outliers in size with greater than 100 000 bytes in length and excluded from the architecture due to their large individual processing time.

Table 1.

Average single report processing times

Number of Annotators	Input Note Size (Bytes)	Average Time to Process (Seconds)
1	0-9999	0.029
	10 000-19 999	0.240
	≥ 20 000	0.543
10	0-9999	0.092
	10 000-19 999	1.635
	≥ 20 000	9.089

Open in a new tab

The final data corpus was comprised of 69 186 378 clinical notes and 14 681 426 reports with 141 700 distinct CUIs. Total counts across the 8 standardized medical vocabularies were 37 721 886 606 CUIs (Table 2).

Table 2.

Features extracted from all clinical documents (n = 83 867 802)

Medical Vocabulary	Frequency (%)
LOINC	17 024 130 269 (45.1%)
SNOMED CT	11 233 587 817 (29.8%)
MeSH	5 752 874 568 (15.3%)
RxNORM	1 580 300 288 (4.2%
ICD-9-CM	818 290 694 (2.2%)
ICD-10-CM	526 588 528 (1.4%)
CPT	599 669 981 (1.5%)
HCPCS	186 444 461 (0.5%)
Total CUIs:	37 721 886 606

Open in a new tab

Abbreviations: CPT, current procedural terminology; HCPCS, health care common procedure coding system; ICD-CM, international classification of diseases, clinical modification; LOINC, logical observation identifiers name and codes; MeSH, medical subject heading; RxNORM, standard names given to clinical drugs and drug delivery devices; SNOMEDct, systematized nomenclature of medicine, clinical terms.

In the analysis of document size, 84.74% (n = 74 458 512) of documents were less than 5000 bytes in length, 13.38% (n = 11 280 870) were between 5000 and 20 000 bytes in length, and 1.88% (n = 1 587 034) were greater than 20 000 bytes in length. In a random distribution of documents across all queues, we noted a substantial performance lag with a total processing time greater than 30 days. The lag was attributable to large file sizes (> 20 000 bytes in length) that consumed a large amount of memory in cTAKES and frequently required more than 1 hour of processing time or stopped the queue indefinitely. Our experience showed that larger documents intermixed with smaller documents would disrupt timing across the entire architecture by blocking the throughput of the smaller documents which delayed batching into the queues and outputs into the Hadoop cluster. With several stages (Figure 1) dependent on a threshold number of documents before proceeding, having dedicated queues for larger document sizes allowed us to change and optimize the thresholds for a particular queue. In this approach, 10 queues (on an additional diverse server) were allocated to process document lengths greater than 20 000 bytes. By applying a nonrandom approach with dedicated queues for different file sizes, the processing speed improved to approximately 13 days with queue performance exceeding 500 000 documents per hour.

Use–case: predicting 30-day hospital readmission

In the prediction model, a data corpus of 13 775 972 clinical documents was examined with 101 710 CUI features. The model from the unprocessed, raw clinical documents was comprised of 1 271 660 unigram features.

The machine learning classifier that produced the highest AUC ROC in the training data set of 126 326 inpatient encounters was a logistic regression model with least absolute errors loss function and regularization. The performance of each standardized vocabulary and n-gram model for predicting 30-day unplanned hospital readmission for the test data set is shown in Table 3. CUIs performed nearly identical to n-grams as measured by AUC ROC.

Table 3.

Test data set for prediction of 30-day readmission (n = 44 705)

Feature	ROC AUC	F1 Score	Precision/PPV	Recall/Sensitivity	Specificity	NPV
Feature	(95% CI)	F1 Score	(95% CI)	(95% CI)	(95% CI)	(95% CI)
SNOMED	0.75	0.67	0.20	0.70	0.67	0.95
SNOMED	(0.74–0.75)		(0.19–0.20)	(0.68–0.71)	(0.67–0.68)	(0.95–0.95)
ICD	0.72	0.65	0.19	0.69	0.65	0.95
ICD	(0.72–0.73)		(0.18–0.19)	(0.67–0.70)	(0.64–0.65)	(0.94–0.95)
LOINC	0.74	0.66	0.19	0.68	0.67	0.95
LOINC	(0.73–0.74)		(0.19–0.20)	(0.67–0.70)	(0.66–0.67)	(0.94–0.95)
RxNORM	0.72	0.66	0.19	0.66	0.66	0.94
RxNORM	(0.71–0.73)		(0.18–0.19)	(0.65–0.68)	(0.66–0.67)	(0.94–0.95)
MeSH	.075	0.68	0.20	0.68	0.68	0.95
MeSH	(0.74–0.75)		(0.19–0.21)	(0.67–0.69)	(0.68–0.69)	(0.95–0.95)
SNOMED + ICD + LOINC + RXNORM + MESH	0.75	0.68	0.19	0.70	0.70	0.95
SNOMED + ICD + LOINC + RXNORM + MESH	(0.74–0.76)		(0.18–0.20)	(0.68–0.71)	(0.67–0.68)	(0.95–0.96)
n-gram(raw text)	0.75	0.74	0.23	0.57	0.77	0.94
n-gram(raw text)	(0.74–0.76)		(0.22–0.23)	(0.57–0.60)	(0.76–0.77)	(0.94–0.94)

Open in a new tab

Test data set from time period between June 11, 2015 and September 30, 2017. Raw text examined were both unigrams and bigrams, and results are for unigrams which had better performance (n-grams).

Abbreviations: CI, confidence interval; F1 score, micro F1; ICD, international classification of diseases; LOINC, logical observation identifiers name and codes; MeSH, medical subject heading; NPV, negative predictive value; PPV, precision/positive predictive value; ROC AUC, receiver operating characteristic area under the curve; RxNORM, standard names given to clinical drugs and drug delivery devices; SNOMEDct, systematized nomenclature of medicine, clinical terms.

DISCUSSION

Our high throughput NLP architecture converted our health system’s data corpus of over 84 million unstructured clinical notes into a completely deidentified data repository of nearly 40 billion structured and standardized data elements. This task was accomplished at a rate of over 500 000 documents per hour through our on-premise data center and required less than 2 weeks processing time. The result for predicting 30-day hospital readmission demonstrates that CUIs perform similar to n-grams. The processed data is a new addition to our clinical research database for researchers and administrators interested in data mining and analytics from any note or report.

NLP engines at large-scale are computationally intensive in both processing the large volume of data and in concept mapping, which makes their application in large cohorts challenging.¹⁶^,¹⁷ Large-scale efforts as described in this study are a major undertaking in IT staff time and resources. cTAKES relies on a JAVA technology that can leverage multiple physical processors and large amounts of local physical memory that makes scaling difficult.⁷^,¹⁸ While the final production may be acceptable to complete over a longer time period, we show implementation of cTAKES is troublesome with near-complete stops (“queue killers”) during processing of larger documents, especially those over 20 000 bytes. Our system’s architecture of subqueues to overcome this hurdle is represented by successful processing of all document sizes within a time frame that is conducive for staff and IT resource allocation. Our demonstration of an input queue management process to allow each queue to have a defined range of input document lengths (so those of similar size went to dedicated server queues) led to greater efficiency and control. Alternate strategies to address this issue, such as processing documents in increasing length order, may work as well. A fast and comprehensive architecture is beneficial to avoid multiple processing times that may quickly accumulate into substantial extra time and effort which is less sustainable at a busy health IT center.

Furthermore, dictionaries are continually being updated and added into cTAKES, so as research needs evolve and grow, multiple annotations on the same document may be required over time. Our inclusion of an additional 70 semantic dictionaries in UMLS was intended to extract social and behavioral determinants of health and help fill the gap in the structured data domains of the EHR,¹⁹ but research requirements and dictionary updates may require more runs of our NLP architecture in the future. The additional computational time presents opportunity costs for other tasks by IT staff, so fast and comprehensive processing times are a priority.

Different approaches between software engineering and hardware infrastructure at other centers have been used for scaling NLP and improving computing performance, but they were limited to subsets of data warehouses involving no more than 2 million documents.²⁰^,²¹ Many studies for high throughput in NLP have focused on specific tasks in phenotyping and methods for circumventing manual annotation in disease classification.^22–25 Investigators at Mayo Clinic implemented a similar architecture and computing infrastructure for an NLP architecture using cTAKES to produce comparable processing speeds to ours, but the authors demonstrate their pipeline in only a subset of their CDW.²⁰ Although the authors used the cTAKES engine for concept mapping, they did not process multiple medical vocabularies at once and provided no guidance on how to scale and manage varying document sizes. To the best of our knowledge, this is the first report describing processing on the entirety of a CDW.

Little work exists in machine learning for the prediction of hospital readmission using EHR text, and models using clinical documents have focused on target patient cohorts with inputs ranging between 5000 and 60 000 features.²⁶^,²⁷ Our readmission prediction model includes a larger data set across more than 200 000 inpatient encounters and greater than 13 million clinical documents, including approximately 100 000 CUI features and 1.3 million unigrams. In a conventional logistic regression model, we aimed to compare the performance of n-grams versus CUIs. We show a CUI-based model has similar discrimination with AUC ROC to n-grams for predicting 30-day unplanned hospital readmission. N-grams were worse in recall but better in specificity when compared to CUI-based models. Although F1 score was better for n-grams, the model was not tuned to achieve the highest F1 score. Additional advantages of CUI-based models over n-gram models include a deidentified format and a higher degree of interpretability from standardized medical concepts that account for lexical variation and semantic ambiguities. This may be more appealing for end users and researchers interested in data linkage with other CDWs in studies that require interoperability between centers. Validation at other centers is needed to support our suggestion that CUI features with standardized medical vocabulary is a viable option over other formats for large-scale clinical research.

FUNDING

This work was supported in part by LUC-HSD and the Center for Health Outcomes and Informatics Research and our NCATS Clinical and Translational Science Award Program (1UL1TR002389-01). Additional support by National Institute of Alcoholism and Alcohol Abuse grant K23AA024503 (MA).

AUTHOR CONTRIBUTIONS

Dr. Afshar and Mr. Price are the guarantors of the manuscript. Concept and design: MA, DD, RP, CJ. Acquisition, analysis or interpretation of data: MA, DD, RP, CJ, FM, BS, XC, JB, SB, DV, SZ. Final approval of the article: MA, DD, RP, CJ, FM, BS, SB, DV, XC, JB, SZ. Administrative, technical, or logistic support: MA, DD, RP, CJ, FM, BS, XC, JB, SZ, SB, DV.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

Supplementary Material

ocz068_Supplementary_Data

Click here for additional data file.^{(27KB, zip)}

ACKNOWLEDGMENTS

We would like to thank our Health Sciences Division provost, Dr. Margaret Callahan, for her support and oversight.

Conflict of interest statement

None declared.

REFERENCES

1. Ford E, Carroll JA, Smith HE, et al. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 2016; 235: 1007–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Meystre SM, Savova GK, Kipper-Schuler KC, et al. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008; 128–44. [PubMed] [Google Scholar]
3. Jones BE, South BR, Shao Y, et al. Development and validation of a natural language processing tool to identify patients treated for pneumonia across VA emergency departments. Appl Clin Inform 2018; 9: 122–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Castro VM, Dligach D, Finan S, et al. Large-scale identification of patients with cerebral aneurysms using natural language processing. Neurology 2017; 882: 164–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Carrell DS, Cronkite D, Palmer RE, et al. Using natural language processing to identify problem usage of prescription opioids. Int J Med Inform 2015; 8412: 1057–64. [DOI] [PubMed] [Google Scholar]
6. Sun W, Cai Z, Li Y, et al. Data processing and text mining technologies on electronic medical records: a review. J Healthc Eng 2018; 2018: 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 175: 507–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Lingeman JM, Wang P, Becker W, et al. Detecting opioid-related aberrant behavior using natural language processing. AMIA Annu Symp Proc 2018; 2017: 1179–85. [PMC free article] [PubMed] [Google Scholar]
9. Yetisgen-Yildiz M, Bejan CA, Wurfel MM. Identification of patients with acute lung injury from free-text chest x-ray reports. In: Proceedings of the 2013 Workshop on Biomedical Natural Language Processing. Sofia, Bulgaria.
10. Xia Z, Secor E, Chibnik LB, et al. Modeling disease severity in multiple sclerosis using electronic health records. PLoS One 2013; 811: e78927. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Desai NR, Ross JS, Kwon JY, et al. Association between hospital penalty status under the hospital readmission reduction program and readmission rates for target and nontarget conditions. JAMA 2016; 31624: 2647–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Cowen ME, Dusseau DJ, Toth BG, et al. Casemix adjustment of managed care claims data using the clinical classification for health policy research method. Med Care 1998; 367: 1108–13. [DOI] [PubMed] [Google Scholar]
13. Corey KM, Kashyap S, Lorenzi E, et al. Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): a retrospective, single-site study. PLoS Med 2018; 1511: e1002701. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Minne L, Eslami S, de Keizer N, et al. Effect of changes over time in the performance of a customized SAPS-II model on the quality of care assessment. Intensive Care Med 2012; 381: 40–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Pedregosa F, Varoquox G, Gramfort A, et al. Scikit learn: machine learning in python. JMLR 2011; 12: 2825–30. [Google Scholar]
16. Divita G, Carter M, Redd A, et al. Scaling-up NLP pipelines to process large corpora of clinical notes. Methods Inf Med 2015; 546: 548–52. [DOI] [PubMed] [Google Scholar]
17. Prosperi M, Min JS, Bian J, et al. Big data hurdles in precision medicine and precision public health. BMC Med Inform Decis Mak 2018; 181: 139. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Gonzalez-Hernandez G, Sarker A, O'Connor K, et al. Capturing the patient's perspective: a review of advances in natural language processing of health-related text. Yearb Med Inform 2017; 26: 214–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Venzon A, Le TB, Kim K.. Capturing social health data in electronic systems: a systematic review. Comput Inform Nurs 2019; 37(2): 90–98. [DOI] [PubMed] [Google Scholar]
20. Kaggal VC, Elayavilli RK, Mehrabi S, et al. Toward a learning health-care system—knowledge delivery at the point of care empowered by big data and NLP. Biomed Inform Insights 2016; 8: 13–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Schlegel DR, Crowner C, Lehoullier F, et al. HTP-NLP: a new NLP system for high throughput phenotyping. Stud Health Technol Inform 2017; 235: 276–80. [PMC free article] [PubMed] [Google Scholar]
22. Gronsbell J, Minnier J, Yu S, et al. Automated feature selection of predictors in electronic medical records data. Biometrics 2019; 75(1): 268–277. [DOI] [PubMed] [Google Scholar]
23. Yu S, Ma Y, Gronsbell J, et al. Enabling phenotypic big data with PheNorm. J Am Med Inform Assoc 2018; 251: 54–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Yu S, Liao KP, Shaw SY, et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J Am Med Inform Assoc 2015; 225: 993–1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Yu S, Chakrabortty A, Liao KP, et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J Am Med Inform Assoc 2017; 24: e143–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Rumshisky A, Ghassemi M, Naumann T, et al. Predicting early psychiatric readmission with natural language processing of narrative discharge summaries. Transl Psychiatry 2016; 610: e921. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Agarwal A, Baechle C, Behara R, et al. A natural language processing framework for assessing hospital readmissions for patients with COPD. IEEE J Biomed Health Inform 2018; 222: 588–96. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocz068_Supplementary_Data

Click here for additional data file.^{(27KB, zip)}

[ocz068-B1] 1. Ford E, Carroll JA, Smith HE, et al. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc 2016; 235: 1007–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B2] 2. Meystre SM, Savova GK, Kipper-Schuler KC, et al. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008; 128–44. [PubMed] [Google Scholar]

[ocz068-B3] 3. Jones BE, South BR, Shao Y, et al. Development and validation of a natural language processing tool to identify patients treated for pneumonia across VA emergency departments. Appl Clin Inform 2018; 9: 122–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B4] 4. Castro VM, Dligach D, Finan S, et al. Large-scale identification of patients with cerebral aneurysms using natural language processing. Neurology 2017; 882: 164–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B5] 5. Carrell DS, Cronkite D, Palmer RE, et al. Using natural language processing to identify problem usage of prescription opioids. Int J Med Inform 2015; 8412: 1057–64. [DOI] [PubMed] [Google Scholar]

[ocz068-B6] 6. Sun W, Cai Z, Li Y, et al. Data processing and text mining technologies on electronic medical records: a review. J Healthc Eng 2018; 2018: 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B7] 7. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 175: 507–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B8] 8. Lingeman JM, Wang P, Becker W, et al. Detecting opioid-related aberrant behavior using natural language processing. AMIA Annu Symp Proc 2018; 2017: 1179–85. [PMC free article] [PubMed] [Google Scholar]

[ocz068-B9] 9. Yetisgen-Yildiz M, Bejan CA, Wurfel MM. Identification of patients with acute lung injury from free-text chest x-ray reports. In: Proceedings of the 2013 Workshop on Biomedical Natural Language Processing. Sofia, Bulgaria.

[ocz068-B10] 10. Xia Z, Secor E, Chibnik LB, et al. Modeling disease severity in multiple sclerosis using electronic health records. PLoS One 2013; 811: e78927. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B11] 11. Desai NR, Ross JS, Kwon JY, et al. Association between hospital penalty status under the hospital readmission reduction program and readmission rates for target and nontarget conditions. JAMA 2016; 31624: 2647–56. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B12] 12. Cowen ME, Dusseau DJ, Toth BG, et al. Casemix adjustment of managed care claims data using the clinical classification for health policy research method. Med Care 1998; 367: 1108–13. [DOI] [PubMed] [Google Scholar]

[ocz068-B13] 13. Corey KM, Kashyap S, Lorenzi E, et al. Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): a retrospective, single-site study. PLoS Med 2018; 1511: e1002701. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B14] 14. Minne L, Eslami S, de Keizer N, et al. Effect of changes over time in the performance of a customized SAPS-II model on the quality of care assessment. Intensive Care Med 2012; 381: 40–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B15] 15. Pedregosa F, Varoquox G, Gramfort A, et al. Scikit learn: machine learning in python. JMLR 2011; 12: 2825–30. [Google Scholar]

[ocz068-B16] 16. Divita G, Carter M, Redd A, et al. Scaling-up NLP pipelines to process large corpora of clinical notes. Methods Inf Med 2015; 546: 548–52. [DOI] [PubMed] [Google Scholar]

[ocz068-B17] 17. Prosperi M, Min JS, Bian J, et al. Big data hurdles in precision medicine and precision public health. BMC Med Inform Decis Mak 2018; 181: 139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B18] 18. Gonzalez-Hernandez G, Sarker A, O'Connor K, et al. Capturing the patient's perspective: a review of advances in natural language processing of health-related text. Yearb Med Inform 2017; 26: 214–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B19] 19. Venzon A, Le TB, Kim K.. Capturing social health data in electronic systems: a systematic review. Comput Inform Nurs 2019; 37(2): 90–98. [DOI] [PubMed] [Google Scholar]

[ocz068-B20] 20. Kaggal VC, Elayavilli RK, Mehrabi S, et al. Toward a learning health-care system—knowledge delivery at the point of care empowered by big data and NLP. Biomed Inform Insights 2016; 8: 13–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B21] 21. Schlegel DR, Crowner C, Lehoullier F, et al. HTP-NLP: a new NLP system for high throughput phenotyping. Stud Health Technol Inform 2017; 235: 276–80. [PMC free article] [PubMed] [Google Scholar]

[ocz068-B22] 22. Gronsbell J, Minnier J, Yu S, et al. Automated feature selection of predictors in electronic medical records data. Biometrics 2019; 75(1): 268–277. [DOI] [PubMed] [Google Scholar]

[ocz068-B23] 23. Yu S, Ma Y, Gronsbell J, et al. Enabling phenotypic big data with PheNorm. J Am Med Inform Assoc 2018; 251: 54–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B24] 24. Yu S, Liao KP, Shaw SY, et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J Am Med Inform Assoc 2015; 225: 993–1000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B25] 25. Yu S, Chakrabortty A, Liao KP, et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J Am Med Inform Assoc 2017; 24: e143–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B26] 26. Rumshisky A, Ghassemi M, Naumann T, et al. Predicting early psychiatric readmission with natural language processing of narrative discharge summaries. Transl Psychiatry 2016; 610: e921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocz068-B27] 27. Agarwal A, Baechle C, Behara R, et al. A natural language processing framework for assessing hospital readmissions for patients with COPD. IEEE J Biomed Health Inform 2018; 222: 588–96. [DOI] [PubMed] [Google Scholar]

PERMALINK

Development and application of a high throughput natural language processing architecture to convert all clinical documents in a clinical data warehouse into standardized medical vocabularies

Majid Afshar

Dmitriy Dligach

Brihat Sharma

Xiaoyuan Cai

Jason Boyda

Steven Birch

Daniel Valdez

Suzan Zelisko

Cara Joyce

François Modave

Ron Price

Abstract

Objective

Materials and Methods

Results

Discussion and Conclusion

INTRODUCTION

MATERIALS AND METHODS

Patient population and health care setting

On-premise data center architecture

Figure 1.

NLP engine with the clinical text and knowledge extraction system (cTAKES)

Figure 2.

Use–case with prediction of 30-day unplanned hospital readmission

Natural language processing and machine learning for use–case

RESULTS

Deployment and production of NLP architecture

Table 1.

Table 2.

Use–case: predicting 30-day hospital readmission

Table 3.

DISCUSSION

FUNDING

AUTHOR CONTRIBUTIONS

SUPPLEMENTARY MATERIAL

Supplementary Material

ACKNOWLEDGMENTS

Conflict of interest statement

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases