Abstract
Patient medical records today contain vast amount of information regarding patient conditions along with treatment and procedure records. Systematic healthcare resource utilization analysis leveraging such observational data can provide critical insights to guide resource planning and improve the quality of care delivery while reducing cost. Of particular interest to providers are hot spotting: the ability to identify in a timely manner heavy users of the systems and their patterns of utilization so that targeted intervention programs can be instituted, and anomaly detection: the ability to identify anomalous utilization cases where the patients incurred levels of utilization that are unexpected given their clinical characteristics which may require corrective actions. Past work on medical utilization pattern analysis has focused on disease specific studies. We present a framework for utilization analysis that can be easily applied to any patient population. The framework includes two main components: utilization profiling and hot spotting, where we use a vector space model to represent patient utilization profiles, and apply clustering techniques to identify utilization groups within a given population and isolate high utilizers of different types; and contextual anomaly detection for utilization, where models that map patient’s clinical characteristics to the utilization level are built in order to quantify the deviation between the expected and actual utilization levels and identify anomalies. We demonstrate the effectiveness of the framework using claims data collected from a population of 7667 diabetes patients. Our analysis demonstrates the usefulness of the proposed approaches in identifying clinically meaningful instances for both hot spotting and anomaly detection. In future work we plan to incorporate additional sources of observational data including EMRs and disease registries, and develop analytics models to leverage temporal relationships among medical encounters to provide more in-depth insights.
1. Introduction
The purposes of analysis of patient healthcare resource utilization patterns include resource planning, allocation and the evaluation of the appropriateness, medical needs and efficiency of health care service and procedures. Such analysis is of increasing importance for health care institutions to ensure effective and efficient patient care delivery. Patient medical records today include a large number of entries related to patient conditions along with treatments and procedures received. Utilization analysis based on such observational data collected through normal course of care delivery and carried out in a systematic manner can be leveraged to improve care delivery in many ways. Two areas in particular have attracted significant attention recently. The first is the notion of hot spotting, which is the ability to identity in a timely manner patients who are heavy users of the system and their patterns of use, so that targeted intense intervention and follow up programs can be put in place to address their needs and change the existing, potentially ineffective, utilization pattern [9]. The second is anomaly detection, where the goal is to identify utilization patterns that are unusual given patients’ clinical characteristics, including both underutilization and overutilization. The former may indicate a gap in medical service that if left unaddressed could result in further deterioration of patient’s condition leading to situations requiring more costly and less effective interventions. The latter incurs unnecessary cost and waste of precious healthcare resources that could have been directed towards cases in real need. Estimates has put the waste caused by overutilization at more than 30% of the total medical cost and this has been confirmed by real world medical management experiences [1].
This paper describes a new framework for utilization analysis designed to address these needs. The framework includes two main components. The first component is Utilization Profiling and Hot Spotting, where we use a vector space model to represent patient utilization profiles, and apply clustering techniques to identify dominant utilization groups within a given population. Hot spotting can then be performed by analyzing small and isolated high utilization groups. The second component is Contextual Anomaly Detection for Utilization. Typical anomaly detection methods focus on identifying data instances that deviate from the majority of the samples [6]. However for healthcare utilization anomaly detection the context provided by the patient’s clinical characteristics is extremely important. A given utilization instance may be perfectly normal for one patient, but unexpected for another patient with different clinical conditions such as comorbidities. We propose novel methods for contextual anomaly detection designed to detect utilization anomalies in such settings. Our method is based on building models trained from observational data to compute the expected utilization levels for each patient given his/her clinical and demographic characteristics, and then examining the difference between the expected and actually levels based on well established statistical methods.
The proposed approaches have been evaluated using out-patient data for a population of 7667 diabetes patients collected over a one year period. The main contribution of this paper is the adaptation and integration of advanced machine learning techniques into an important care management application that can be used to perform systematic utilization analysis on any given patient population, to identify clinically meaningful cases of heavy utilization as well as anomalous utilization. Such insights can potentially assist care providers achieve better resource allocation and better management of gaps and opportunities in care, leading to improved patient outcomes at reduced cost in the long run. It’s worth noting that utilization anomalies could also be indicators of potential medical fraud that require further investigation, however this aspect will not be the focus of this paper.
2. Background
Existing work on medical utilization pattern analysis has focused on disease specific studies and has not directly proposed a general framework for addressing the issues of hot spotting or anomaly detection. For example, Barsky et al. introduced a clustering method to detect medical care utilization patterns for somatizing patients [2]. Nicholson et al. conducted research on patterns of ambulatory care use for gynecologic conditions [18]. Eisele et al. studied the ambulatory medical care utilization patterns before and after the diagnosis of dementia in Germany [7]. Ruchlin et al. learned the resident medical care utilization patterns in continuing care retirement communities [20]. Bushche et. al. analyzed ambulatory medical care utilization by elderly patients in relation to patient conditions in Germany [5]. While these past studies each shed valuable light on the factors affecting the pattern of utilization in a specific disease condition, they were not designed to provide systematic approaches that can be adopted for routine utilization analysis on any given patient population.
Anomaly detection as a general topic has been studied for wide ranging domains including financial fraud detection, industrial damage detection, social media analysis, and medical and public health anomaly detection [6]. Of particular relevance to medical utilization analysis are two main types of anomaly detection, namely Point Anomalies and Contextual Anomalies.
Point anomalies refer to cases where an individual data instance (e.g., number of visits of different types) can be considered as anomalous to the rest of the data. This is the simplest form of anomaly and is the focus of majority of the research on anomaly detection in general. Most of the prior work in medical domain falls into this category [19, 25, 15, 22]. The utilization profiling and hot spotting component of our proposed framework can be considered an instance of this type of anomaly detection using a clustering based approach (other common methods include classification based, statistical techniques, information theory based, etc.) [6]. In the clustering based approach, singleton clusters as well as small clusters are considered point anomalies, based on the operating assumption that normal data instances belong to large, dense clusters, while anomalies belong to small, isolated ones .
Contextual anomalies are more complex cases where a data instance is anomalous in a specific context (e.g., given the particular characteristics for a patient), but not otherwise, hence cannot be detected using methods designed for point anomalies. In order to detect contextual anomalies, each data instance has to be defined using two sets of attributes:
Contextual attributes. The contextual attributes are used to determine the context for that instance, for example, patient characteristics such as comorbidities.
Behavioral attributes. The behavioral attributes are used to determine the non-contextual characteristics of an instance, for example, number of clinical visits of various types (primary care physician, specialists, emergency room etc.)
The goal of contextual anomaly detection is to determine whether a particular value for the behavioral attributes is unusual within a specific context. Methods for contextual anomaly detection are particularly valuable in medical utilization analysis as they provide more comprehensive indicators by evaluating the utilization profile of each patient in the context of what is expected for patients with similar characteristics. In doing so they can uncover important anomalies for further investigation that would remain undetected using point anomaly detection methods.
We introduce a new approach to contextual anomaly detection in the context of medical utilization analysis based on expectation modeling, using regression models. Similar type of regression model based approach has been applied before to anomaly detection in time series [6], but to our knowledge has never been applied to medical utilization analysis. The most closely related work in medical domain is by Hauskrecht and colleagues [12, 11]. They studied methods for conditional anomaly detection to flag potential medical error by identifying medical actions that are unusual with respect to past patients and their conditions. However their methods are classification based and designed to assess the appropriateness of specific medical actions as binary variables. Such methods cannot be easily extended to the study of general muti-dimensional utilization patterns with arbitrary values.
3. Methods
The proposed utilization analysis framework contains two main components. In the first component a multidimensional clustering method is applied to segment a given patient population into groups with similar utilization profiles as characterized by numbers of clinical visits of different types. The purpose of the clustering analysis is two fold: 1) to identify the dominant utilization patterns within the population, and 2) to identify groups of high utilizers and understand their general characteristics. In contrast to patient stratification based on total cost or a particular type of utilization (e.g. emergency visits), such multidimensional clustering analysis provides a more comprehensive understanding of the characteristics of different types of utilization.
In the second component we apply the concept of contextual anomaly detection to this domain and develop an expectation modeling based approach to identify patients with anomalous utilization records. The basic idea is as follows. First a regression model is trained from the observational data of recorded patients characteristics and corresponding utilization profiles. This model is then used to calculate the expected utilization level (behavior) with respect to any given patient profile (context). Then a comparison is made between the observed and expected behaviors and the Grubb’s test which is widely used for anomaly detection [10, 6] is deployed to determine whether there is an anomaly. In the following we describe each component in detail.
3.1. Utilization Profiling and Hot Spotting
We use pi to indicate the i-th patient. The whole patient population set is denoted by 𝒫 = {p1, p2, ⋯, pn}, where n is the number of patients. The patient’s utilization is characterized by the number of different types of visits (e.g., visit to Primary Care Physician (PCP), visit to specialist, lab visit, etc.) incurred by this patient during a certain time period. We use vi to denote the visit of type i, and 𝒱 = {v1, v2, ⋯, vd} to denote the set of different types of visits (suppose there are totally d types of visits). Then the utilization of patient pi is represented as a d-dimensional vector xi [xi1, xi2, ⋯, xid]⊤, where xij is the number of vj visits incurred on patient pi. We call xi the utilization profile of patient pi. For example, if the utilization profile of a patient is [0,1,3,0,0]⊤, this means that there are totally d = 5 types of visits, and the patient had 1 v2 type visit as well as 3 v3 type visits.
Once each patient is represented as a vector in a multi-dimensional space, a variety of clustering algorithms can be applied to segment the patient population into cohorts of patients with similar utilization profiles. To choose the most appropriate algorithm the following factors need to be taken into consideration:
Interpretability. We not only want to identify the patient cohorts, but also want to understand how those patients are grouped together.
Stability. We want the algorithm to be stable such that clustering assignments do not change much against small parameter and/or data perturbation.
Scalability. As our goal is to provide a general tool for patient utilization analysis, where we may encounter a large patient population, it is important for the approach to be capable of handling large scale data.
In general, data clustering algorithms can be classified into two categories: partitioning methods and hierarchical methods [13]. Partitioning methods formulate clustering as an optimization problem, which makes it performance-driven. Here performance measure can be cluster compactness as used in K-means [16], normalized graph cuts [21] or the margins between different clusters [26]. However, these methods are usually not stable. Furthermore, as they focus on algorithm performance rather than interpretability, it is often difficult to interpret the cluster procedure and results. Based on these considerations, we decided to adopt one of the most representative hierarchical methods: Hierarchical Agglomerative Clustering (HAC) [13], which merges the data vectors one by one in a bottom-up manner according to a distance metric. As a by-product, HAC generates an easy to explore dendrogram explaining the whole clustering procedure, which makes it very convenient for users to control and investigate the clustering results.
One potential problem of HAC is its scalability, as it relies on pairwise data distances. This results in an at least O(n2) computational complexity. To alleviate this problem, we borrow an idea from image segmentation and develop a hybrid two-stage HAC algorithm. In image segmentation, a common approach for handling large number of pixels is to first aggregate them into a set of homogeneous “superpixels”, and then perform clustering on those superpixels [23]. Similarly, in the first stage of our approach, we segment the patient population into a large number of micro clusters, then in the second stage, we perform HAC on these micro clusters. The algorithm flow is illustrated in Fig.1, where each blue dot corresponds to a patient, the outside green circle represents the utilization profile vector space. We use red shaded areas to indicate patient clusters, and red dots to denote cluster representatives (e.g., cluster mean).
Many efficient partition based methods could be used for the over-segmentation stage. We chose to use Classification And Regression Tree (CART) [3] algorithm to take advantage of the fact that cost information is typically available in utilization data, and patients with very similar utilization profiles should also have very similar cost. The vectors representing patients’ utilization profiles are treated as input features and used to predict cost as the target variable. The CART algorithm constructs a rule based decision tree to segment the patient set by recursively partitioning the feature space until the patients within each partition satisfy certain purity constraint (based on cost). The final partitions correspond to the leaf nodes of the tree.
At the end of the first stage, the mean utilization profile from each micro-cluster is extracted and treated as the cluster representative. Then in the second stage we cluster these representatives with HAC. As stated above, HAC starts with every representative as a cluster, and merge them step by step. At each step, two nearest clusters are merged. Here the distance of two clusters 𝒫i and 𝒫j is measured by
(1) |
where ni, nj are the sizes of 𝒫i, 𝒫j, and is the Euclidean distance between xi and xj. xk,xl are utilization profiles of pk, pl. It can be easily seen that this is in fact the average distance between all pairs of data points with one in 𝒫i and one in 𝒫j.
3.2. Contextual Anomaly Detection for Utilization
Our contextual anomaly detection approach consists of the following three steps. First, we learn functions that map clinical characteristics ( contextual attributes) to utilization characteristics. These regression models are then used to estimate the expected number of visits of each type. Finally, a statistical test (Grubb’s test [10]) is applied to check if a significant difference exists between expected and actual utilization levels.
In the first step, the contextual attributes include patient demographics (age and gender) and clinical features characterized by ICD-9 codes. One potential shortcoming of using ICD-9 codes alone is that they do not adequately reflect clinical relations and risk groupings among different diagnoses. Much past work on Health Risk Assessment (HRA) has deployed various methods of generating risk groups, and reported improved accuracy of healthcare cost prediction using such groupings [24]. We adopted the Hierarchical Condition Categories (HCC) used in Medicare Risk Adjustment provided by CMS (Centers for Medicare and Medicaid Services) in addition to the ICD-9 codes in the contextual attributes. For both ICD-9 codes and HCC codes, the clinical feature is defined as the percent of times that specific diagnosis was given in the utilization analysis period, which provides a measure of dominance of the corresponding condition for a patient. The target variables for the expectation models are the behavioral attributes, which in this case are the numbers of visits for the different utilization types. A separate expectation model is built independently for each utilization type.
For the regression model, we explored several advanced function learners:
Classification And Regression Trees (CART) [3] is similar to a decision tree except at the leaf level a regression model is constructed in order to map to a continuous target variable, instead of doing a majority vote as in a decision tree classifier.
Random Forest (RF) [4] is an ensemble version of CART where multiple CART trees are built on the bootstrapping samples of the entire patient set. Here the bootstrapping is uniform such that each data point has an equal opportunity to be sampled with replacement.
Multivariate Adaptive Regression Splines [8] (MARS) a non-parametric regression technique and can be seen as an extension of linear models that automatically models non-linearities and interactions between variables.
We used 10-fold cross-validation to evaluate these different methods. In our experiments, Random Forest consistently outperform all other methods in all cases, and was thus adopted in our system.
Once the regression models have been trained, they are used to compute the expected level of utilization of each specific type for each patient given the contextual attributes of the patient. The difference between the expected and actual utilization (the residual error) can then be used to determine whether there is an anomaly. Intuitively, an actual utilization level should be declared anomalous if it deviates too much from the expected level. The key question to answer is: how much is too much? Certain utilization types may naturally have a wider range of variability associated with them than others and thus should be allowed larger deviation. We deploy Grubb’s test which had been widely used in the anomaly detection literature [10, 6] to take into consideration this inherent variability.
The test statistic for the i-th patient of j-th type of utilization as
(2) |
where is the squared error between the expected and the actual value of the j-th type of utilization j on the patient i, r̄j and sj are the mean and standard deviation of . Then the patient i is declared as anomaly on the j-th type of utilization if
(3) |
Here n is the number of patients and tα/2n,n−2 is a threshold used to declare an instance to be anomalous or normal. This threshold is the value taken by a t-distribution at a significance level of α/2n. The significance level measures the confidence associated with the threshold and indirectly controls the number of instances declared as anomalous. A patient is declared an anomaly if it is identified as an anomaly by the Grubb’s test for at least one type of visits.
4. Results and Analysis: Diabetes Patient Management
4.1. Data description
The proposed framework has been tested using claims data collected from a network of physicians over a one year period. While the framework is very general and can be applied to any patient population, it is useful to focus on a specific use case to investigate whether the results are clinically meaningful. We thus constrained our experiments to the diabetes patient population which contains a total of 7,667 patients. For this population 98% of the total visits belong to one of the top six visit types as given in Table 1, which also provides statistics for these top visit categories. From the table it can be clearly observed that the majority of the patients had relatively low level of utilization. For example, the 50 percentile of the total number of visits is only 12, i.e., half of the patients only made up to 12 visits to medical facilities during the year.
Table 1:
Visit Type | Description | #Visits | Summary Statistics | Percentiles | ||||
---|---|---|---|---|---|---|---|---|
median | mean | std | 50% | 80% | 95% | |||
1 | PCP visit | 61,253 | 6 | 7.99 | 6.90 | 6 | 12 | 20 |
2 | Specialist visit | 77,255 | 6 | 10.08 | 15.88 | 6 | 15 | 32 |
3 | Emergency visits | 5,731 | 0 | 0.75 | 3.06 | 0 | 0 | 4 |
4 | Outpatient hospital visits | 34,047 | 0 | 4.44 | 13.46 | 0 | 6 | 18 |
5 | Inpatient hospital visits | 20,826 | 0 | 2.72 | 12.68 | 0 | 0 | 14 |
6 | Patient’s home | 15,389 | 0 | 2.00 | 5.15 | 0 | 4 | 9 |
4.2. Results for Utilization Segmentation and Hot Spotting
The modified Hierarchical Agglomerative Clustering approach described in Section 3.1 was applied to this patient population. The resulting dendrogram can then by explored interactively by a domain expert. For each cluster, the user can examine the representative utilization profile of the cluster (computed as the cluster mean), average cost, and patient characteristics such as mean age, sex ratio and dominant diagnoses. For this diabetes population, a close examination by the MD in our group revealed that a total of 10 clusters provides a meaningful level of segmentation.
Figure 2 shows the representative utilization profile for each cluster (cluster mean). Table 2 shows for each cluster the cluster size, average cost, average age and a clinical description of the cluster derived by the MD based on information provided by the system as explained above. It can be clearly seen from this analysis that clusters 1–4 represent well managed patients with varying but stable conditions, leading to relatively low level of utilization and cost. Cluster 5, 6 and 8 represent patients with more advanced disease state and advancing complications, thus requiring increased utilization. Finally, clusters 7, 9 and 10 are the ”hot spot” patients with advanced conditions requiring intense utilization of different types. These are patients who will likely benefit from an intensive disease management program.
Table 2:
ID | Size | Ave. Cost/Patient | Ave. Age | Cluster Description |
---|---|---|---|---|
1 | 377 | 5,758 | 69 | Cohort consists primarily of well-managed Diabetics with Hypertension, Hyperlipidemia and cardiac arrhythmias, with cost-effective use of Primary Care, Specialty Care and Outpatient Hospital Clinics, avoiding Hospitalizations and ER Visits |
2 | 959 | 4,720 | 70 | Cohort consists primarily of well-managed Diabetics with Hypertension, Hyperlipidemia and some cardiac disease with cost-effective use of Primary Care, requiring some Specialist visits while avoiding Outpatient Clinics, Hospitalizations and ER Visits. |
3 | 807 | 2,580 | 66 | Cohort of Diabetics with complications of Hypertension, Hyperlipidemia and some with cardiac arrhythmias. They make cost-effective use of Primary Care, while avoiding Specialist, Outpatient Clinics, Hospitalizations and ER Visits. |
4 | 5013 | 1,573 | 63 | Cohort of younger patients with uncomplicated Diabetics with Hypertension, Hyperlipidemia, making minimal use of services. This cohort is a target for interventions with preventative services |
5 | 239 | 10,150 | 69 | Cohort of Diabetics with Hypertension, Hyperlipidemia, cardiac arrhythmias, and arthritis, making extensive use of Specialists, while avoiding Hospitalizations and ER Visits. |
6 | 127 | 11,738 | 69 | Cohort of more advanced diabetics with increasing comorbities, and complications requiring periodic hospitalization for exacerbations of heart failure, pulmonary disease and renal failure. |
7 | 3 | 66,480 | 63 | Cohort of high-utilizing diabetics with advancing renal failure, cardiac and peripheral vascular disease. Leukemia is a significant coexisting comorbidity. These patients require frequent expensive hospital outpatient procedures and use of specialists. |
8 | 112 | 16,375 | 66 | Cohort of diabetics with advancing complications thereof, being managed by specialists. Higher representation of women. Outpatient hospital visits likely include treatment of peripheral vascular disease, vascular ulcers and bone infections. |
9 | 14 | 42,559 | 71 | Cohort of older diabetics likely to have a high percentage of smokers with costly complications including malignancies, COPD and complex outpatient treatment thereof. |
10 | 16 | 41,980 | 57 | Cohort of end-stage diabetics with advanced complications including renal failure, heart failure, poorly controlled hypertension, requiring frequent hospitalizations, outpatient hospital procedures, and home health visits. |
4.3. Results for Contextual Anomaly Detection
As described in Section 3.2, a separate expectation model was trained for each one of the top six utilization types using Random Forest regression model, using diagnoses, age and sex as the contextual attributes to predict the expected level of utilization. Table 3 shows the prediction results for each each one of the six utilization types using standard 10 fold cross validation. As can be seen in the table, a positive R2 measure was achieved for all utilization types, including even Emergency visit, which is particularly difficult to predict because of the sparsity of the event, and large degree of randomness (e.g., accidents). For visit type involving less degree of randomness, the performance improves as expected. Particularly, for Specialist and Inpatient hospital visits we achieved R2 values greater than 0.3. These results indicate that the proposed expectation model can indeed lead to better prediction of expected utilization level than using population mean, which should lead to more personalized and clinically meaningful anomaly detection.
Table 3:
PCP | Specialist | Emergency | Inpatient Hospital | Outpatient Hospital | Patients’ Home | |
---|---|---|---|---|---|---|
RMSE | 6.11 | 13.26 | 2.93 | 10.30 | 12.15 | 4.81 |
R2 | 0.22 | 0.30 | 0.085 | 0.34 | 0.18 | 0.13 |
For each patient the difference between the expected level and actual level of utilization is compared against the mean residual error and the Grubb’s test is used to determine if this different is anomalous. A patient is considered anomalous if he/she is signaled as such for at least one of the utilizaton types. Using a significance level of 0.05 in Grubb’s test, a total of 51 anomalies were detected. These anomalies can then be explored in the system by examining the actual vs. expected utilization levels, and contextual attributes including age, sex and dominant diagnosis to determine the next step of investigation. Here we provide sample investigations of three of the patients with anomalous utilizations. Figure 3 shows the expected vs. actual utilization for each patient, and Table 4 provides the characteristics of these patients along with investigation notes and recommendations from the MD on our team who examined the cases.
Table 4:
ID | Age | Top Cormobidities | Anomaly Analysis |
---|---|---|---|
6071 | 48 | other symptoms involving abdomen and pelvis; heart failure; essential hypertension; obesity and other hyperalimentation; nondependent abuse of drugs; ill-defined descriptions and complications of heart disease; acute bronchitis and bronchiolitis | Diabetic with advancing complications including heart failure, thus requiring large number of specialist visits. Patient is underutilizing PCP and Specialists, and appears non-compliant with poor dietary control and is likely a smoker. This person over-utilizes the ER, possibly for both diabetic and respiratory complications, as well as drug-seeking behavior. |
1311 | 67 | acquired hypothyroidism; disorders of lipoid metabolism; asthma;nontoxic nodular goiter; essential hypertension; other disorders of urethra and urinary tract; other and unspecified anemias | Older diabetic, with prior thyroidectomy and resultant hypothyroidism, plus possible urinary stress incontinence, as well as poorly controlled asthma. The latter may be due to suboptimal medication regime and/or non-compliance, resulting in unexpectedly frequent PCP and Specialist visits, and hospital admissions from the doctor’s office. Home health visits are likely for home respiratory therapy. Patient likely has been trained to contact her doctors instead of using the ER. |
2815 | 59 | other disorders of cervical region; symptoms involving head and neck; disorders of lipoid metabolism; other disorders of soft tissues; intervertebral disc disorders; other forms of chronic ischemic heart disease; gastrointestinal hemorrhage; cellulitis and abscess of finger and toe | likely a non-compliant diabetic with hyperlipidemia, cardiac disease and vascular disease leading to skin ulcers and infections. Cervical disc disease is a comorbidity. This patient over-utilizes specialists and the ER, most likely due to diabetic complications, and chronic pain related to cervical disc disease. Unexpected hospitalizations are likely due to complications related to non-compliance, and outpatient hospital visits may be related to antibiotic treatment of skin infections secondary to vascular disease and poor self-care. Alcohol abuse should also be investigated in light of the history of gastrointestinal hemorrhage |
5. Conclusions
We present a novel framework for utilization analysis that can be used to perform systematic and timely identifications of heavy users of different types as well as contextual anomalies, i.e., utilization instances that are unexpected given patients’ clinical characteristics. In order to assess the general applicability of the framework, in this initial exploration we restricted our experiments and analysis to the most widely available type of data, i.e., claims data including diagnosis, demographics, and medical utilization records. Our evaluations and case studies demonstrate the usefulness of the proposed approaches in identifying clinically meaningful instances for both hot spotting and anomaly detection, using the most basic observational data as described above. Clearly many other data sources such as EMRs and patient and disease registries could provide additional information relevant to utilization analysis. In our future work we plan to expand our framework to leverage these additional data sources to provide enhanced performance and additional actionable insight. Another limitation of the proposed methods is that we currently do not consider temporal relationships among different medical events or encounters. Exploration of such relationships could provide deeper context and more fine-grained utilization patterns as well as contextual attributes. Temporal event analysis in medical domain has been widely studied in the literature for discovery of medical knowledge and decision support [17, 14]. The incorporation and expansion of these methodologies for medical utilization analysis is another importance direction of our future work.
References
- [1].http://leanmedicalcare.org/?p=85
- [2].Barsky AJ, Orav EJ, Bates DW. Distinctive patterns of medical care utilization in patients who somatize. Med Care. 2006;44(9):803–811. doi: 10.1097/01.mlr.0000228028.07069.59. [DOI] [PubMed] [Google Scholar]
- [3].Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. CRC Press; Boca Raton: 1984. [Google Scholar]
- [4].Breiman Leo. Random forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
- [5].Bussche HVD, Schon G, Kolondo T, Hansen H, Wagscheider K, Glaeske G, Koller D. Patterns of ambolatory medical care utilization in elderly patients with special reference to chronic diseases and multimorbidity - results from a clams data based observational study in germanuy. BMC Geriatrics. 2011;11(1) doi: 10.1186/1471-2318-11-54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Chandola V, Banerjee A, Kumar V. Technical Report TR 07-017, Dept of Computer Engineering, Univ Minnesota. 2007. Anomaly detection: A survey. [Google Scholar]
- [7].Eisele M, van den Bussche H, Koller D, Wiese B, Kaduszkiewicz H, Maier W, Glaeske G, Steinmann S, Wegscheider K, Schön G. Utilization patterns of ambulatory medical care before and after the diagnosis of dementia in germany–results of a case-control study. Dement. Geriatr. Cogn. Disord. 2010 Jul;29(6):475–483. doi: 10.1159/000310350. [DOI] [PubMed] [Google Scholar]
- [8].Friedman JH. Multivariate adaptive regression splines. Annals of Statistics. 1991;19(1):1–67. [Google Scholar]
- [9].Gawande A. The hot spotters. New Yorker. 2011 Jan; [PubMed] [Google Scholar]
- [10].Grubbs F. Procedures for detecting outlying observations in samples. Technometrics. 1969;11(1):1–21. [Google Scholar]
- [11].Hauskrecht M, Valko M, Batal I, Clermont G, Visweswaran S, Cooper G. Proceedings of the Annual American Medical Informatics Association (AMIA) Symposium. 2010. Conditional outlier detection for clinical alerting; pp. 286–290. [PMC free article] [PubMed] [Google Scholar]
- [12].Hauskrecht M, Valko M, Kveton B, Visweswaram S, Cooper G. Proceedings of the Annual American Medical Informatics Association (AMIA) Symposium. 2007. Evidence-based anomaly detection in clinical domains; pp. 319–323. [PMC free article] [PubMed] [Google Scholar]
- [13].Jain AK, Dubes Richard C. Algorithms for Clustering Data. Prentice-Hall; Upper Saddle River, NJ, USA: 1988. [Google Scholar]
- [14].Lee N, Laine A, Hu J, Wang F, Sun J, Ebadollahi S. Mining electronic medical records to explore the linkage between healthcare resource utilization and disease severity in diabetic patients. First IEEE International Conf. on Health Informatics, Imaging and Systems Biology; 2011. [Google Scholar]
- [15].Jessica Lin, Eamonn Keogh, Ada Fu, Van Herle Helga. Approximations to magic: Finding unusual medical time series. 18th IEEE Symp. on Computer-Based Medical Systems (CBMS); 2005. pp. 23–24. [Google Scholar]
- [16].MacQueen JB. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press; 1967. Some methods for classification and analysis of multivariate observations; pp. 281–297. [Google Scholar]
- [17].Moskovitch MR, Shahar Y. Medical temporal knowledge discovery vis temporal abstraction. AMIA Annual Symposium Proceedings; 2009. pp. 452–456. [PMC free article] [PubMed] [Google Scholar]
- [18].Nicholson WK, Ellison SA, Grason H, Powe NR. Patterns of ambulatory care use for gynecologic conditions: A national study. Am. J. Obstet. Gynecol. 2001 Mar;184(4):523–530. doi: 10.1067/mob.2001.111795. [DOI] [PubMed] [Google Scholar]
- [19].Penny KI, Jolliffe IT. A comparison of multivariate outlier detection methods for clinical laboratory safety data. Journal of the Royal Statistical Society: Series D (The Statistician) 2001 Sep;50(3):295–307. [Google Scholar]
- [20].Ruchlin HS, Morris S, Morris JN. Resident medical care utilization patterns in continuing care retirement communities. Health Care Financ Rev. 1993 Summer;14(4):151–168. [PMC free article] [PubMed] [Google Scholar]
- [21].Shi Jianbo, Malik Jitendra. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000. Normalized cuts and image segmentation; pp. 888–905. [Google Scholar]
- [22].Syed Z, Saeed M, Rubinfeld I. Proceedings of the Annual American Medical Informatics Association (AMIA) Symposium. 2010. Identifying high-risk patients without labeled training data: Anomaly detection methodologies to predict adverse outcomes; pp. 772–776. [PMC free article] [PubMed] [Google Scholar]
- [23].Tighe J, Lazebnik S. Proceedings of the 2010 European Conf on Computer Vision (ECCV) 2007. Superparsing: Scalable nonparametric image parsing with superpixels; pp. 319–323. [Google Scholar]
- [24].Winkelman R, Mehmod S. A comparative analysis of claims-based tools for health risk assessment. Society of Actuaries Report. 2007 [Google Scholar]
- [25].Wong WK, Moor A, Cooper G, Wagner M. Proceedings of the 20th International Conference on Machine Learning. 2003. Bayesian network anomaly pattern detection for disease outbreaks; pp. 808–815. [Google Scholar]
- [26].Linli Xu, Neufeld James, Larson Bryce, Schuurmans Dale. Maximum margin clustering. Advances in Neural Information Processing Systems. 2004;17 [Google Scholar]