Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Nov 4.
Published in final edited form as: Proc ACM Int Conf Inf Knowl Manag. 2022 Nov 4;2022:4828–4832. doi: 10.1145/3511808.3557157

MedCV: An Interactive Visualization System for Patient Cohort Identification from Medical Claim Data

Ashis Kumar Chanda 1, Tian Bai 2, Brian L Egleston 3, Slobodan Vucetic 4
PMCID: PMC9830554  NIHMSID: NIHMS1861513  PMID: 36636516

Abstract

Healthcare providers generate a medical claim after every patient visit. A medical claim consists of a list of medical codes describing the diagnosis and any treatment provided during the visit. Medical claims have been popular in medical research as a data source for retrospective cohort studies. This paper introduces a medical claim visualization system (MedCV) that supports cohort selection from medical claim data. MedCV was developed as part of a design study in collaboration with clinical researchers and statisticians. It helps a researcher to define inclusion rules for cohort selection by revealing relationships between medical codes and visualizing medical claims and patient timelines. Evaluation of our system through a user study indicates that MedCV enables domain experts to define high-quality inclusion rules in a time-efficient manner.

Keywords: Medical claims, electronic health records, embedding, deep learning, visual analytics, cohort identification, retrospective study

1. INTRODUCTION

Healthcare providers generate a medical claim after each patient visit as a list of medical codes describing the diagnosis and treatment. The two most common medical code vocabularies are the International Classification of Diseases (ICD) and Current Procedural Terminologies (CPT). Given the wide range of human diseases and the available treatments, the number of available medical codes is very large (in tens of thousands). Thus, it is challenging for humans to read and interpret medical claims.

Thanks to the ubiquity and standardized format of medical claim data, they have been popular in medical research as a data source for retrospective cohort studies of the association between factors such as demographics, diagnosis, and treatment and outcomes such as mortality and readmission [10, 12, 26, 29]. The first step in any study using medical claims is cohort identification [19, 27] or patient phenotyping [9, 25], which refer to identifying patients with a particular characteristic [3, 16]. Cohort identification requires specifying an inclusion criterion in a programmatic way that scans the medical claims and accurately selects the patients of interest. Writing an inclusion criterion by defining a set of codes corresponding to a particular diagnosis or treatment is daunting due to the vast number of medical codes.

The primary purpose of the proposed visualization software is to help medical researchers define inclusion criteria. In particular, we provide a visual interface that displays medical codes related to a query code. When a user enters a query code, the software shows other codes that either occur in similar types of claims or co-occur with the query code. To identify codes that occur in similar types of claims, we train word2vec [15] model on a medical claims database and use the cosine distance between the resulting code embeddings as the similarity measure. In order to identify codes that tend to co-occur with query code, we use the pairwise mutual information (PMI) metric [24]. In addition to finding the related codes, the software enables users to drill down and visualize claims and patient timelines containing a particular code or sets of codes.

Several data visualization tools have been proposed to help users gain insight into a large collection of electronic health records and medical claims, such as analyzing event sequences of multiple patients to understand associations between diagnoses, treatments, and outcomes [11, 17, 23, 28]. Composer is a representative visual tool [23] that helps orthopedic surgeons dynamically define treatment patterns from patient medical histories to analyze patient-reported outcomes. Rule-based queries were proposed to specify temporal constraints in cohort identification [11]. Medical event embedding was used for visual progression analysis by dividing the hospital visits into several stages [8]. However, we are unaware of prior research on visual analytics tools that support cohort identification from medical claims by exploiting similarity metrics between medical codes.

2. METHODOLOGY AND SYSTEM ARCHITECTURE

We use two metrics to define medical code similarity. The first relies on word2vec model [15] that represents codes as vectors, such that codes that occur in similar types of medical claims have similar representation [1, 2, 7]. The second uses PMI [24] that defines codes as similar if they tend to co-occur in claims. There is a mathematical relationship between PMI and word2vec similarity measures [22].

2.1. Word2vec method

Word2vec [15] has been used in medical informatics to represent medical words and codes as low dimensional vectors, or embeddings [1, 2, 57, 20]. In our application, word2vec model (skip-gram variant) treats every claim as a document and every code in a claim as a word. Let us denote a claim with K codes as N = {c1, c2, cK}. The model calculates the log-likelihood of observing all codes in a claim given the remaining code in the claim ct as

=ciCctlog p(cict), (1)

where P(ci|ct) is the conditional probability of observing code ci given ct.

If U is a matrix of the scanned code embeddings and V is the matrix of the context code embeddings, where U, V|Vcode|×f and |Vcode| is the total number of codes and f is the embedding dimension, then the conditional probability is defined using softmax function as

p(cict)=eUctVcicj|Vcode|eUctVcj, (2)

where Uct and Vci are vectors of codes ct and ci, respectively. A stochastic gradient algorithm is used to maximize the objective function to learn vector representations of all the codes in the dataset. The cosine distance between the learned vector representations is a similarity score between two codes. The high cosine distance value means that two codes occur in similar types of claims, indicating that they have a similar meaning.

2.2. PMI method

PMI [24] measures if two codes tend to co-occur more often than by chance. The PMI between codes ci and cj is defined as

PMI(ci,cj)=logP(cicj)P(ci), (3)

where P(ci | cj) is the conditional probability of observing code ci given code cj in the same claim and P(ci) is the marginal probability of seeing ci in any claim. The probabilities can be calculated based on the counts of occurrence and co-occurrence of the codes. A high PMI value means that two codes co-occur together more frequently than by chance.

2.3. Visual system

To design a visual system that can help users to analyze the relationships between different medical codes and select appropriate codes to identify a cohort, we worked with physicians at Fox Chase Cancer Center for several months. We followed an iterative software development process [18] that required meeting domain experts to get their feedback about problem definition and design task analysis. Figure 1 illustrates the architecture of our MedCV system and Figure 2 shows the interface. MedCV has six components: A) Query view, B) Table view, C) Projection view, D) Selected code view, E) Claim view, and F) Patient view.

Figure 1:

Figure 1:

The architecture of our proposed MedCV software

Figure 2:

Figure 2:

A screenshot of MedCV with four views: (a) Query view: user writes a query, such as a patient treatment name or a medical code, (b): Table view: shows a ranked list of related medical codes based on two different metrics, (c) Projection view: displays medical code relationships in a 2D view responsive to user interaction, (d): Selected code view: keeps a record of medical codes selected by the user.

Query view:

Query view in Figure 2(a) allows user to enter a query and set code search parameters. As a query, a user can type a name of diagnosis or procedure or enter a CPT or ICD-9 medical code. A user can select which types of codes they are interested in exploring. Based on the user’s query, MedCV provides a set of codes related to the query code in both the Projection view and the Table view.

Projection view:

This view visualizes relationships between codes. From the learned word2vec representations of medical codes, our system finds the nearest neighbors of the query code. It provides a 2-dimensional t-SNE [14] projection of code vectors that allows users to explore the code relationships. The Projection view is shown in Figure 2(c). Different colors are used to group codes based on ICD-9 or CPT code hierarchy. When the mouse hovers over a circle, concise information with code description, frequency, and PMI value is shown for the corresponding code. This interface allows users to filter codes by clicking on the legend icon. The underlined code in the Projection view draws attention to neighboring codes with low PMI values (≤1) and high word2vec similarity. The underlined codes are interesting because word2vec neighbors with low PMI indicate codes that occur in the same types of claims but do not co-occur in the same claims.

Table view:

The Table view is an alternative way to inform users about the word2vec and PMI metrics, co-occurrence frequencies, and description of each neighboring code. There are two types of frequencies in the table: PF and CF are the times a code occurs in patients and claims, respectively; BCF is the number of times the query code and the listed code occur in the same claim. The table rows have the same color as the color of codes circled in the projection view. The table view allows sorting and filtering by code to further facilitate code selection.

The table view has a column “select” that provides three options to the user: “add”, “do not add”, and “explore”. If the user thinks the code is related to the query code and wants to add it as an inclusion criterion, the “add” is selected. After adding the code, it will be listed in the selected code view table. If the user thinks the code is unrelated to the given query, “not add” will be selected, which fades the row in the table view. If “explore” is selected, a new window opens that allows the user to observe representative claims containing this code.

Claim view:

The claim view shows fifty randomly selected claims in a t-SNE [14] projection that contains the code selected in the Table view, as shown in Figure 3. We use the vector of primary code in a claim to represent a claim as a vector for the t-SNE projection. The Claim view lets a user explore the claim distance in projected space and show some actual claim data with code descriptions.

Figure 3:

Figure 3:

Claim view and patient view: Claim view shows 50 example claims for recently selected code by a user from Table view; Patient view shows the timeline for the selected claim from claim view. Here, a user selects claim #48 from the claim, and the Patient view shows the whole patient timeline of the selected claim #48 and highlights it in the blue circle. ICD-9 diagnosis and CPT codes are marked with the prefix “d_” and “h_”, respectively.

In addition, the user might be interested to know about a patient claim history surrounding a particular claim to understand the claim better. MedCV also shows a Patient view when the user clicks on a claim circle in the Claim view.

Patient view:

The Patient view shows a complete claim history of a selected patient, as seen in Figure 3. The x-axis shows time, and the y-axis shows a patient’s primary diagnosis. The selected claim is shown in blue in the figure. An expert user would focus on this point and look at the past and future claims to better understand the health condition of the patient and better reason about the selected claim.

Selected code view

Whenever the user decides to add a code from the Table view as an inclusion criterion for the given query, the code is listed in this view. After completing the user study, this list would be the final result for a given query or patient treatment.

3. DEMONSTRATION

A video demonstration of MedCV is available online1 and below, we present a brief overview of the workflow of MedCV.

The user will interact with the above-mentioned views to discover inclusion criteria for a cohort. Figure 1 shows the workflow of different views. The user starts from the Query view and observes related codes in Projection and Table views. After analyzing code results, they can add codes as inclusion criteria in the Table view by selecting the “add” option from the “select” button. The selected code will appear in the Selected code view. However, if the user is interested in seeing some example claims for a code, they can go to the Claim view from the table by selecting the “explore” option. From the Claim view, the user can also observe the selected claim in the Patient view. After examining the Claim and Patient view, the user would return to the Table view to decide whether they are interested in adding the code or not. Finally, the Selected code view shows the final selected codes for cohort identification.

4. EVALUATION

To evaluate MedCV, we recorded subjective user experience [13] to understand if the users believe that it helps them to define inclusion rules for cohort identification tasks and if they are confident that their rules are good. We also measured how well the codes selected by users matched the gold standard.

4.1. Dataset

We used a publicly available synthetic dataset from DE-SynPUF2 that is provided by the Centers for Medicare and Medicaid Services (CMS). This dataset contains claims for three types of visits (inpatient, outpatient, and carrier claims) corresponding to three years (2008–2010). The synthetic dataset is closely related to a real-life dataset, and previous research indicates that it retains essential properties about the actual patient population it has been derived from [4, 21]. We extracted 66 million claims from about 1 million synthetic patients. Each claim is a set of ICD-9 and CPT codes. There are a total of 14,838 unique ICD-9 diagnosis codes, 7,195 unique ICD-9 procedure codes, and 11,565 unique CPT codes.

4.2. Training process

We trained word2vec (skip-gram approach)3 and PMI4 on DE-SynPUF dataset. We used all the default parameters for the training (i.e. vector length = 100, learning rate = 0.01, epoch = 40, and number of negative samples = 5), other than the window size, which was set to a large number to include all codes in a claim as context.

4.3. User study

We recruited four medical experts to use our software and provide feedback. Three were medical experts from Fox Chase Cancer Center and one from Temple University Hospital. All participants had prior experience using software for patient data analysis. We prepared an introductory document and a video to explain the goals and tasks of our software to the users and demo its functionality. The training process took around 15 minutes.

To evaluate the ability of the participants to use our software for code selection, we asked them to find medical codes that identify patients receiving chemotherapy. In addition, we asked one of the participants to find the codes for excisional biopsy. For each treatment, we asked each participant to spend up to 15 minutes providing a list of CPT and ICD-9 medical codes that indicate the treatment. To evaluate the quality of the code selection, we compared the selection with the gold standard consisting of a list of medical codes manually extracted by clinical researchers during a multi-month effort, which was published in Appendix of [3].

At the end of each session, we asked the participants to fill out a survey to explain how they felt about their interaction with MedCV. In the survey, participants rated different aspects of MedCV on a scale from 1 to 10, such as effectiveness, flexibility, advantages, and future adoption.

4.4. Results

For the “chemotherapy” treatment, there are 21 codes listed in the gold standard from Appendix of [3]. Since four of those codes did not occur in over ten claims, we removed them from this list, resulting in 17 gold standard codes. All four participants identified at least ten codes, ranging from 11 to 16. None of the participants selected a code that was not on the list. This is an impressive result, considering that finding 17 gold standard codes required a multiple-month effort by two medical researchers.

For the “local excision of breast” treatment, the gold standard included 3 CPT codes. Our participant found all 3 CPT codes and selected two additional codes that were not in the gold standard. After consulting with the authors of [3], we concluded that those two additional codes are indeed related to biopsy or excision of breast (i.e., 19316 “suspension of breast”, 38525 “biopsy/removal lymph nodes”). Thus, we do not consider the selection to be incorrect.

We found from the participant survey that most participants thought the Table view was beneficial. Two of the participants found the Projection view less intuitive. All participants were positive about the effectiveness, advantages, and potential for future adoption of MedCV.

5. CONCLUSIONS

In this paper, we described a visual tool, MedCV, that aims to assist researchers and clinicians in defining the inclusion criteria for patient cohort identification from medical claim data. We demonstrated that visualizing codes, claims, and patient timelines positively impact users’ reasoning during the cohort identification task. For the evaluation of MedCV, we recruited four experts and performed a user study. The study revealed that the experts could rapidly find codes identifying two types of treatments. The experts also provided favorable feedback about the software. The results indicate that MedCV could become a valuable tool for retrospective cohort studies using medical claim data.

CCS CONCEPTS.

• Human-centered computing → Visualization toolkits.

ACKNOWLEDGMENTS

This was funded in part by NIH/NCI grant P30CA006927.

Footnotes

Contributor Information

Ashis Kumar Chanda, Temple University, PA, USA.

Tian Bai, Temple University, PA, USA.

Brian L. Egleston, Fox Chase Cancer Center, PA, USA

Slobodan Vucetic, Temple University, PA, USA.

REFERENCES

  • [1].Bai Tian, Chanda Ashis Kumar, Egleston Brian L, and Vucetic Slobodan. 2018. EHR phenotyping via jointly embedding medical concepts and words into a unified vector space. BMC medical informatics and decision making 18, 4 (2018), 123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Bai Tian, Egleston Brian L, Bleicher Richard, and Vucetic Slobodan. 2019. Medical concept representation learning from multi-source data. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 4897–4903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Bleicher RJ, Ruth K, Sigurdson ER, Ross E, Wong YN, Patel SA, Boraas M, Topham NS, and Egleston BL. 2012. Preoperative Delays in the US Medicare Population With Breast Cancer. Journal of Clinical Oncology 30, 36 (2012), 4485–4492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Cai Xiangrui, Gao Jinyang, Ngiam Kee Yuan, Ooi Beng Chin, Zhang Ying, and Yuan Xiaojie. 2018. Medical concept embedding with time-aware attention. arXiv preprint arXiv:1806.02873 (2018). [Google Scholar]
  • [5].Chanda Ashis Kumar, Bai Tian, Yang Ziyu, and Vucetic Slobodan. 2022. Improving medical term embeddings using UMLS Metathesaurus. BMC Medical Informatics and Decision Making 22, 1 (2022), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Choi Edward, Schuetz Andy, Stewart Walter F, and Sun Jimeng. 2016. Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv preprint arXiv:1602.03686 (2016). [Google Scholar]
  • [7].Choi Youngduck, Chiu Chill Yi-I, and Sontag David. 2016. Learning low-dimensional representations of medical concepts. AMIA Summits on Translational Science Proceedings 2016 (2016), 41. [PMC free article] [PubMed] [Google Scholar]
  • [8].Guo Shunan, Jin Zhuochen, Gotz David, Du Fan, Zha Hongyuan, and Cao Nan. 2018. Visual progression analysis of event sequence data. IEEE transactions on visualization and computer graphics 25, 1 (2018), 417–426. [DOI] [PubMed] [Google Scholar]
  • [9].Halpern Y, Horng S, Choi Y, and Sontag D. 2016. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association (2016), ocw011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Klabunde CN, Potosky AL, Legler JM, and Warren JL. 2000. Development of a comorbidity index using physician claims data. Journal of clinical epidemiology 53, 12 (2000), 1258–1267. [DOI] [PubMed] [Google Scholar]
  • [11].Krause Josua, Perer Adam, and Stavropoulos Harry. 2015. Supporting iterative cohort construction with visual temporal queries. IEEE transactions on visualization and computer graphics 22, 1 (2015), 91–100. [DOI] [PubMed] [Google Scholar]
  • [12].Krumholz HM, Wang Y, Mattera JA, Wang Y, Han LF, Ingber MJ, Roman S, and Normand ST. 2006. An administrative claims model suitable for profiling hospital performance based on 30-day mortality rates among patients with heart failure. Circulation 113, 13 (2006), 1693–1701. [DOI] [PubMed] [Google Scholar]
  • [13].Lam Heidi, Bertini Enrico, Isenberg Petra, Plaisant Catherine, and Carpendale Sheelagh. 2011. Empirical studies in information visualization: Seven scenarios. IEEE transactions on visualization and computer graphics 18, 9 (2011), 1520–1536. [DOI] [PubMed] [Google Scholar]
  • [14].van der Maaten Laurens and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605. [Google Scholar]
  • [15].Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013). arXiv:1301.3781 http://arxiv.org/abs/1301.3781 [Google Scholar]
  • [16].Miller David C, Saigal Christopher S, Warren Joan L, Leventhal Meryl, Deapen Dennis, Banerjee Mousumi, Lai Julie, Hanley Jan, and Litwin Mark S. 2009. External validation of a claims-based algorithm for classifying kidney-cancer surgeries. BMC health services research 9, 1 (2009), 92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Monroe Megan, Lan Rongjian, Lee Hanseung, Plaisant Catherine, and Shneiderman Ben. 2013. Temporal event sequence simplification. IEEE transactions on visualization and computer graphics 19, 12 (2013), 2227–2236. [DOI] [PubMed] [Google Scholar]
  • [18].Munzner Tamara. 2009. A nested model for visualization design and validation. IEEE transactions on visualization and computer graphics 15, 6 (2009), 921–928. [DOI] [PubMed] [Google Scholar]
  • [19].Nattinger AB, Laud PW, Bajorunaite R, Sparapani RA, and Freeman JL. 2004. An algorithm for the use of Medicare claims data to identify women with incident breast cancer. Health services research 39, 6p1 (2004), 1733–1750. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Nguyen Dang, Luo Wei, Venkatesh Svetha, and Phung Dinh. 2018. Effective identification of similar patients through sequential matching over icd code embedding. Journal of medical systems 42, 5 (2018), 94. [DOI] [PubMed] [Google Scholar]
  • [21].Paudel Ramesh, Eberle William, and Talbert Doug. 2017. Detection of anomalous activity in diabetic patients using graph-based approach. In The Thirtieth International Flairs Conference. [Google Scholar]
  • [22].Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543. [Google Scholar]
  • [23].Rogers Jen, Spina Nicholas, Neese Ashley, Hess Rachel, Brodke Darrel, and Lex Alexander. 2019. Composer—visual cohort analysis of patient outcomes. Applied clinical informatics 10, 02 (2019), 278–285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Turney Peter D. and Pantel Patrick. 2010. From Frequency to Meaning: Vector Space Models of Semantics. J. Artif. Intell. Res 37 (2010), 141–188. 10.1613/jair.2934 [DOI] [Google Scholar]
  • [25].Warren JL, Harlan LC, Fahey A, Virnig BA, Freeman JL, Klabunde CN, Cooper GS, and Knopf KB. 2002. Utility of the SEER-Medicare data to identify chemotherapy use. Medical care 40, 8 (2002), IV–55. [DOI] [PubMed] [Google Scholar]
  • [26].Warren Joan L, Mariotto Angela, Melbert Danielle, Schrag Deborah, Doria-Rose Paul, Penson David, and Yabroff K Robin. 2016. Sensitivity of Medicare claims to identify cancer recurrence in elderly colorectal and breast cancer patients. Medical care 54, 8 (2016), e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Winkelmayer WC, Schneeweiss S, Mogun H, Patrick AR, Avorn J, and Solomon DH. 2005. Identification of individuals with CKD from Medicare claims data: a validation study. American Journal of Kidney Diseases 46, 2 (2005), 225–232. [DOI] [PubMed] [Google Scholar]
  • [28].Wongsuphasawat Krist, Guerra Gómez John Alexis, Plaisant Catherine, Wang Taowei David, Taieb-Maimon Meirav, and Shneiderman Ben. 2011. Life-Flow: visualizing an overview of event sequences. In Proceedings of the SIGCHI conference on human factors in computing systems. 1747–1756. [Google Scholar]
  • [29].Yan Y, Birman-Deych E, Radford M, Nilasena D, and Gage B. 2005. Comorbidity indices to predict mortality from medicare data: results from the national registry of atrial fibrillation. Medical care 43, 11 (2005), 1073–1077. [DOI] [PubMed] [Google Scholar]

RESOURCES