Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2020 Mar 4;2019:504–513.

Machine Learned Mapping of Local EHR Flowsheet Data to Standard Information Models using Topic Model Filtering

Steven G Johnson 1, Lisiane Pruinelli 1,2, Bonnie L Westra 1,2
PMCID: PMC7153147  PMID: 32308844

Abstract

Electronic health record (EHR) data must be mapped to standard information models for interoperability and to support research across organizations. New information models are being developed and validated for data important to nursing, but a significant problem remains for how to correctly map the information models to an organization’s specific flowsheet data implementation. This paper describes an approach for automating the mapping process by using stacked machine learning models. A first model uses a topic model keyword filter to identify the most likely flowsheet rows that map to a concept. A second model is a support vector machine (SVM) that is trained to be a more accurate classifier for each concept. The stacked combination results in a classifier that is good at mapping flowsheets to information models with an overall f2 score of 0.74. This approach is generalizable to mapping other data types that have short text descriptions.

Introduction

Healthcare data is now readily available in electronic form thanks to the continued adoption of electronic health records (EHR). Never before have we had easy access to so much data that has the promise to improve patient outcomes. But the secondary use of this data for analysis and sharing across organizations is stymied because we don’t have robust standard and formal information models to support analyzing and comparing the data1. An information model is a formal structure for representing the clinical information in the EHR and includes data elements, relationships between the elements, and rules that the data elements should satisfy2. While efforts such as the Observational Medical Outcomes Partnership (OMOP), Patient-Centered Outcomes Research Institute (PCORI), Fast Healthcare Interoperability Resources (FHIR) and Clinical Information Modeling Initiative (CIMI) have made good progress toward standardized information models3-6, those efforts have not given attention to data captured in the delivery of nursing and other inter-professional areas.

Much of this data is semi-structured and captured as “flowsheet” data7. Flowsheet data is used by many healthcare organizations to record custom and non-standardized information into the EHR. It is arranged as a spreadsheet like data entry with rows representing the different data types and columns representing time periods to record observations of the rows. There is usually not an information model that the EHR builders use to model flowsheet data. Flowsheet data is particularly difficult to model because most of it is not coded to standard terminologies like Logical Observation Identifiers Names and Codes (LOINC) or Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT). Therefore, flowsheet data can’t be mapped using tools such as Regenstrief LOINC Mapping Assistant (RELMA)8 or MetaMap9. Flowsheet rows (data types) typically have very little information that describe the data that is being captured.

Recent work has resulted in the development of information models in 10 key areas important to nursing10. Eight organizations formed an Information Model Validation Work Group (IMVWG) to evaluate and validate these models across their organizations. Two of the models are nearing completion for validation across organizations. One of the models, the Pain Information Model (Pain IM), has been validated and published11. However, a significant problem still remains even after standard information models have been made available for use. The problem is that each organization then has the unenviable task of mapping the standard information models to the data in their local EHR system. This can be a daunting, time-consuming and error prone task, particularly since organizations may have multiple flowsheet rows for the same type of data element. Often these are created for different units, disciplines, or types of settings, and change over time with upgrades to the system.

In the EHR systems of the eight organizations in the IMVWG, flowsheet data types are described using two 90- character fields. The first field is the flowsheet data type’s internal name (called a flo_meas_name), and the second is the text that is displayed on the data entry screen (called a flo_disp_name). Each flowsheet data type also has a unique identifier, which can be different in each organizations’ EHR (called a flo_meas_id), but those identifiers are not guaranteed to be unique across different systems. For example, different systems can use flo_meas_id=12345. The first may use it to represent “Blood Pressure” and the second can use it to represent “Heart Rate”. Furthermore, the descriptions for the same concept (i.e. “Pain Rating 0-10 Scale”) may use the text “Pain Scale” at one health system and “Pain Rating” at another.

Once a standard information model has been developed (for example, the Pain IM), each of the organizations in the IMVWG assigned staff to map their flowsheet data types to the information model concepts. This was a time- consuming manual process that needed to be done after every model was developed. It required staff to search through flowsheet row descriptions and find the rows that they believed mapped to the concept in the information model. The problem was worse for new organizations that wanted to join the IMVWG. To get the benefit of the work group, they would have to manually map their flowsheet data to all of the information models, which required a large initial commitment of time and resources on their part. In order to make that initial mapping task less daunting, the IMVWG sought a way to automate the mapping process. The work group looked at natural language processing (NLP) and machine learning as possible approaches. Approaches using traditional NLP techniques were deemed unlikely to work because the flowsheet row descriptions are very short, are not sentences and don’t have traditional parts of speech. There has been work on mapping short descriptions as part of the caBIG project, but those tools were focused on rank ordering matches and supporting researchers in manually mapping between common data elements12.

The purpose of this study was to develop and implement a technique for mapping local EHR flowsheet data types that have short descriptions to standard information models using a machine learning approach. The technique will be illustrated using the Pain IM.

Methods

The overall approach has four steps:

  1. Develop a model using the IMVWG manually mapped flowsheet rows as training data for a machine learning algorithm

  2. Evaluate the model performance using the f2 score

  3. Validate the model predictions by having researchers review the results

  4. Update the incorrect mappings from the IMVWG and then re-train and re-evaluate the model

Machine learning is a powerful technique for developing models. The basic approach for mapping the Pain IM to local EHR data was to obtain examples of correctly mapped flowsheet rows to concepts in the Pain IM and use that data as training data to build a machine learned model that can label new flowsheet rows to map it to the right concept in the Pain IM. A portion of the Pain IM is shown in Figure 1.

Figure 1.

Figure 1.

Pain Information Model (partial)

One task of the IMVWG was for each organization to manually map their organization’s flowsheet rows to the Pain IM. As a result of that work, eight organizations mapped 1,837 flowsheet rows to the 103 concepts in the Pain IM. These previous manually developed mappings served as training data for the machine learning models.

Model Development

The method selected for development of a machine learned model was influenced by the flowsheet data structure and volume. In the EHR used by most of the workgroup participants, flowsheet rows are described using 60-120 characters of text. Because the EHR builders had to fit quite a bit of information into a small space, much of the description makes use of abbreviations and short phrases instead of full sentences to describe the flowsheet data types, so the words in the descriptions act like keywords.

The initial attempt to create a machine learned model for predicting labels used a term frequency- inverse document frequency (tf-idf) approach13. Each short description (the 60-120 characters of text) that was defined in the EHR that described each flowsheet row is considered a document. Tf-idf creates a bag-of-words count of each of the terms that occur in these descriptions, but gives a higher weighting to rare terms and a lower weight to common terms. This results in word vectors that are the set of features covering all of the flowsheet row descriptions.

For the training, depending on which organization’s data was left out for testing, there were typically 100,000 negative examples (unmapped flowsheet rows) and approximately 500 positive examples (flowsheet rows that were mapped to concepts in the Pain IM), which is an extremely unbalanced dataset for machine learning. And in addition to just looking at each individual word for the bag-of-words parsing, tf-idf was configured to also included all of the unique two- and three-word phrases (bigrams and trigrams). This led to having models with 800,000 to 1,000,000 features. Training a support vector machine (SVM) using that many features on the unbalanced dataset would be very slow and would not result in good model performance.

There are a number of approaches in machine learning that attempt to address the imbalance including random under- or over-sampling14 and SMOTE15 that creates synthetic samples. These approaches have disadvantages in that they either add noise or remove potentially useful information. Some studies have found that using model ensembles can intelligently identify sub-samples that will improve model performance when using imbalanced data16. Therefore, a stacked model approach17 was pursued where the first model would filter the flowsheet rows to identify rows that should definitely not be categorized as a Pain concept and the second model would be a SVM to more accurately classify a flowsheet row as a Pain concept or not. The first stage of the stacked model significantly reduced the negatives using a topic modeling filter based on the TextRank algorithm18. A topic model was developed for each concept in the Pain IM. The top N topics were selected for that concept (enough to ensure that 100% of the positive training data was included in the topic filter). This topic filter was then applied to the training data and the second model, a SVM, was trained on the remaining data.

A separate topic model and SVM was built to classify each of the concepts in the Pain Information Model. The Pain IM has 103 concepts, so in the end there were 103 2-stage models run on the unmapped flowsheet data of a new organization to label and map their flowsheet rows. The SVM produces a score which is a probability that a label is correct for a particular flowsheet row. Each model (topic filter and SVM) was applied to the flowsheet rows, and the label from the SVM that produced the highest score (probability) was used to label the flowsheet row. The process is shown in Figure 2. The software was written in Python and used the gensim19 library for tf-idf and TextRank and scikit-learn20 libraries for the SVM models.

Figure 2.

Figure 2.

Process for mapping flowsheet rows using a stacked classifier for each Information Model concept

Model Evaluation

The performance of the model was evaluated using the f2 score. The f2 score is a combination of the recall and precision of the model, but it is biased toward recall. The formula for f2 is shown in Figure 3.

Figure 3.

Figure 3.

The f2 score defined in terms of precision and recall

In our use case, we don’t mind having more False Positives (when the model labels a flowsheet row, even though it may not correctly map to a concept in the information model) so the f2 score is a good choice for model evaluation. The f2 score weights recall as twice as important as precision. Using a hyperparameter search, an SVM probability score threshold was found that maximized the f2 score across the entire set of SVMs.

In order to simulate the mapping process that occurs when a new organization joins the IMVWG, an approach was taken to train the model on 7 organizations’ mappings and leave one organization out to be used as the testing data to evaluate the model performance. This was done 8 times, leaving a different organization’s data out as the testing data each time.

Model Validation

The manually developed mappings from the workgroup were used as the “gold standard” to train the model. These mappings were known to have mistakes, such as a single flowsheet row mapping to the wrong concept or two different concepts in the IM. Or a flowsheet row might not be mapped to the any concept even though the concept existed in the Pain IM. So we also carried out a mapping validation process where the predictions of the model were manually reviewed by at least two researchers to determine if the original “gold standard” mappings were correct or if the prediction from the model was correct.

It would have been difficult to have the researchers manually review all 126,957 flowsheet row mappings from the eight IMVWG organizations. Instead, the researchers reviewed only the model predictions that differed from the IMVWG mappings. The differences fell into four Mismatch Categories:

  1. A flowsheet row wasn’t mapped by the IMVWG, but the model predicted a new label that the researchers verified was correct.

  2. A flowsheet row was mapped by the IMVWG (and verified correct by the researchers) but the model did not predict any label.

  3. A flowsheet row was mapped by the IMVWG, but the model predicted a new label that the researchers verified was correct and better than the IMVWG.

  4. A flowsheet row was mapped by the IMVWG, but the model predicted a label that the researchers verified was the wrong label.

Based on this review, the labels of the original IMVWG mappings were updated and the model training and testing process was performed again to determine the final model performance.

Results

The flowsheet row data from eight organizations was aggregated into a database containing information about 126,957 flowsheet rows in total. The model training process was run on 7 organizations’ data and the model was evaluated using the flowsheet data from the organization that was left out. The results of this process when using the original mappings are shown in Table 1.

Table 1.

Model performance scores (f2) for each organization (Original labels)

Org TP FP FN TN Rec Prec f2
1 50 119 36 12,504 0.30 0.58 0.33
2 12 142 17 9,586 0.08 0.41 0.09
3 72 90 17 13,040 0.44 0.81 0.49
4 154 156 44 21,172 0.50 0.78 0.54
5 245 202 227 33,439 0.55 0.52 0.54
6 31 168 19 14,219 0.16 0.62 0.18
7 61 103 127 7,049 0.37 0.32 0.36
8 133 198 25 13,500 0.40 0.84 0.45
All 758 1,178 512 124,509 0.39 0.60 0.42

The f2 score ranged from 0.09 to 0.54 with an overall f2 score of 0.42. There were 1,690 FPs and FNs where the model and the IMVWG mappings did not match. Three researchers (two were clinicians) examined those mappings to determine the correct labels for the flowsheet rows. The Cohen’s Kappa was 0.80, which indicates good agreement between the researchers for what the correct labels should be.

Examples of IMVWG flowsheet labels and predicted labels are shown in Table 2. The table includes examples that were correctly mapped (TP) as well as examples of mappings from each of the four Mismatch Categories.

Table 2.

Examples of Model Mappings and Mismatches

graphic file with name 3203239t1.jpg

Using the updated labels, the models were retrained in the same manner by using 7 organizations data for training and then testing on the organization that was left out. This was repeated so that each organization was used as the test data once. The resulting model statistics are shown in Table 3. The f2 scores improved and ranged from 0.59 to 0.86, with an overall f2 score of 0.74.

Table 3.

Model performance scores (f2) for each organization (updated labels)

Org TP FP FN TN Rec Prec f2
1 96 38 48 12,527 0.72 0.67 0.71
2 111 35 18 9,593 0.76 0.86 0.78
3 114 43 14 13,048 0.73 0.89 0.75
4 245 69 14 21,198 0.78 0.95 0.81
5 408 169 162 33,374 0.71 0.72 0.71
6 104 81 37 14,215 0.56 0.74 0.59
7 112 38 127 7,063 0.75 0.47 0.67
8 277 53 9 13,517 0.84 0.97 0.86
All 1,467 526 429 124,535 0.74 0.77 0.74

Discussion

The final model performance was quite good with an overall f2 score of 0.74. This approach will be very useful to new organizations that are added to the IMVWG. They will be able to get a high percentage of their flowsheet data types mapped automatically to the information models. The stacked models did a good job of finding the true positives (flowsheet rows that match the organization’s manual mappings) and true negatives (flowsheet rows that should not be labeled) and the false positives and false negatives were minimized. There are a number of findings from this research discussed below.

Manual mapping is error-prone. The first finding is that having human reviewers manually mapping flowsheet rows to concepts in an IM is error prone. There are many examples of mappings that were missed by organizations. When the authors reviewed the results, we found 388 additional flowsheet rows that could be mapped to a concept. This is not surprising since the clinicians in the organization who are performing the mapping can’t possibly be aware of all the places the concept is in their flowsheet data. They were given a tool that can do sophisticated Boolean search expressions using keywords, but that still relies on the skill of the mapper to remember all the ways a concept is represented in their flowsheet data and which keywords to use for a search. A better approach is to have the computer suggest mappings and then the human reviewer only needs to decide if it is a good mapping or not. The automated mapping techniques described in this study makes that approach possible.

Automated mapping needs good training data. The models did not perform as well when there were insufficient flowsheet instances present in training data. For example, only two of the organizations had a single flowsheet row named “Pain Level” mapped to the Pain Rating 0-10 Scale concept. When one of the organizations was used as the testing data, there was only one instance of “Pain Level” in the training data, which was not enough for the SVM to be able to consistently classify it to the concept of Pain Rating 0-10 Scale. On the other hand, there were many organizations that used phrases like “Pain Rating” and “Pain Scale”, so the SVM consistently mapped those types of flowsheet rows to the “Pain Rating 0-10 Scale” concept.

Short description mapping will work in other domains. This approach is generalizable to other short description mapping problems. Within the EHR, there are many types of data that are described with keyword-like phrases. For example, orders are typically created as a custom list for each healthcare organization. There currently is not a standardized list of orders, but even if one were to exist, the job of mapping each organizations’ orders to the standard would be time-consuming and error prone and would benefit from the approach described in this paper.

There are some limitations with this research. This approach was only tested using the Pain Information Model. The approach should work well on the other nine nursing IMs, but the work to carry out that validation still needs to be performed. A second limitation is that performance is dependent on the quality of the IM. Our goal was to map a flowsheet row to the best equivalent concept in the IM, which sometimes meant it was mapped to a higher level concept because an exact equivalent concept did not exist in the IM. Also, all of the organizations for this study use the same EHR vendor. It would be helpful to use this approach on flowsheet data types from other EHR vendors to ensure that the approach works with data from all vendors. Finally, the eight organizations involved in this research are diverse geographically, but tend to be medium to large in the size of their institutions and are all from inside United States (US). To fully validate this approach, flowsheet data should be obtained from smaller organizations and organizations from outside the US.

Crowd-sourced mapping. Additional work is needed to develop tools to make it easier to review the mapped data and to specify which labels are correct and which are not. This would make it easy for multiple reviewers from an organization to quickly review how the model labeled their data and also review mappings from other organizations. In this way, the “gold standard” training data would continue to get better and as new organizations join the workgroup, the predicted mappings from the automated process would also improve, thereby reducing the workload across the group.

Conclusion

This research shows that it is possible to use machine learning to automate the mapping of flowsheet rows to standard information models. Furthermore, the same approach should work wherever there is a need to map short description items to an information model. The model performance depends on having good training data from a variety of organizations. There is a need to develop tools and processes to support collaborative mapping of local EHR data to models so that the workload of doing the mapping can be distributed and the benefits of automating the mapping can be used by many organizations.

Acknowledgements

This work was supported in part by NIH NCATS grant UL1 TR002494. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Figures & Table

References

  • 1.Holve E, Segal C, Hamilton Lopez M. Opportunities and Challenges for Comparative Effectiveness Research (CER) With Electronic Clinical Data. Med Care. 2012;50((7)):S11–8. doi: 10.1097/MLR.0b013e318258530f. [DOI] [PubMed] [Google Scholar]
  • 2.Goossen W, Goossen-Baremans A, Zel M van der. Detailed clinical models a review. Healthc Inf Res. 2010;16((04)):201–14. doi: 10.4258/hir.2010.16.4.201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Observational Medical Outcomes Partnership (OMOP). [Internet]. [cited 2015 Jul 15] Available from: http://omop.org/ [Google Scholar]
  • 4.Health Level Seven (HL7) Clinical Information Modeling Initiative. [Internet]. 2016 [cited 2016 Jan 2]. Available from http://www.hl7.org/Special/Committees/cimi/index.cfm. [Google Scholar]
  • 5.Health Level Seven (HL7) Welcome to FHIR. [Internet]. 2015 [cited 2016 Feb 15]. Available from: https://www.hl7.org/fhir/ [Google Scholar]
  • 6.Fleurence R, Curtis L, Califf R. Launching PCORnet, a national patient-centered clinical research network. J Am Med Informatics Assoc. 2014;21:578–82. doi: 10.1136/amiajnl-2014-002747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Johnson SG, Byrne MD, Christie B, Delaney CW, Laflamme A, et al. In: AMIA 2015 Summit on Clinical Research Informatics Proceedings. San Francisco, CA: American Medical Informatics Association; 2015. Modeling Flowsheet Data for Clinical Research; pp. p.77–81. [PMC free article] [PubMed] [Google Scholar]
  • 8.LOINC. RELMA. [Internet]. [cited 2018 Oct 11]. Available from: https://loinc.org/relma/ [Google Scholar]
  • 9.Aronson A. In: Proceedings of the AMIA Symposium 2001. American Medical Informatics Association; 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program; pp. p. 17–21. [PMC free article] [PubMed] [Google Scholar]
  • 10.Westra BL, Christie B, Johnson SG, Pruinelli L, LaFlamme A, Sherman SG, et al. Modeling Flowsheet Data to Support Secondary Use. CIN Comput Informatics, Nurs. 2017;35((9)):452–8. doi: 10.1097/CIN.0000000000000350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Westra B, Johnson S, Ali S, Bavuso K, Cruz C, Collins S, et al. Validation and Refinement of a Pain Information Model from EHR Flowsheet Data. Appl Clin Inform. 2018;09((01)):185–98. doi: 10.1055/s-0038-1636508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kunz I, Lin Mc, Frey L. Metadata mapping and reuse in caBIG. BMC Bioinformatics. 2009;10(Suppl 2):1–11. doi: 10.1186/1471-2105-10-S2-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Process Manag. 1988;24((5)):513–23. [Google Scholar]
  • 14.Chawla N V, Japkowicz N, Kotcz A. Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explor Newsl. 2007;6((1)):1. [Google Scholar]
  • 15.Chawla Nitesh V., Bowyer KW. Hall LO. SMOTE: Synthetic Minority Over-sampling Technique Nitesh. J Artif Intell Res. 2006 Sep 28;2009:321–57. [Google Scholar]
  • 16.Yang P, Zhang Z, Zhou BB ZA. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin: Springer; 2011. Sample subset optimization for classifying imbalanced biological data; pp. p. 333–44. [Google Scholar]
  • 17.Wolpert D. Stacked generalization. Neural networks. 1992;5((2)):241–59. [Google Scholar]
  • 18.Mihalcea R, Textrank Tarau P. In: Proceedings of the 2004 conference on empirical methods in natural language processing. 2004. Bringing order into text. [Google Scholar]
  • 19.Rehurek R, Sojka P. Poceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA. 2010. Software framework for topic modelling with large corpora; pp. p. 45–50. [Google Scholar]
  • 20.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES