Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2012 Nov 3;2012:370–379.

Anomaly detection in clinical processes

Zhengxing Huang 1, Xudong Lu 1, Huilong Duan 1
PMCID: PMC3540475  PMID: 23304307

Abstract

Meaningful anomalies in clinical processes may be related to caring performance or even the patient survival. It is imperative that the anomalies be timely detected such that useful and actionable knowledge of interest could be extracted to clinicians. Many previous approaches assume prior knowledge about the structure of clinical processes, using which anomalies are detected in a supervised manner. For a majority of clinical settings, however, clinical processes are complex, ad hoc, and even unknown a prior. In this paper, we investigate how to facilitate detection of anomalies in an unsupervised manner. An anomaly detection model is presented by applying a density-based clustering method on patient careflow logs. Using the learned model, it is possible to detect whether a particular patient careflow trace is anomalous with respect to normal traces in the logs. The approach has been validated over real data sets collected from a Chinese hospital.

Introduction

Clinical process, as a unique type of patient-linked process (i.e., diagnostic and therapeutic procedures to be carried out for a specific patient), is becoming an important issue in the healthcare domain[1,2,3,4]. Unlike the traditional business processes, e.g., the manufacturing processes, which are specific sequences of steps, with few variations, and are used to produce particular parts of products, clinical processes are executed according to a diagnostic-therapeutic cycle, comprising observation, reasoning and action[1,4]. The diagnostic-therapeutic cycle heavily depends on medical knowledge to deal with case-specific decisions that are made by interpreting patient-specific information[4]. During clinical process execution, the patient state might be changed dynamically, which in turn urges process itself to be adaptive as well[3,5,6,7].

In clinical practice, patient states are uncertain and are affected by many factors. These uncertainties can result from inter-observer variability, inaccurate evaluation of the patient and some deficiencies in grading scales. Although many studies have been proposed to support modeling flexibility and instance adaptation of clinical processes[3,4,5,6,8], the patient careflow may not go well towards the expected direction, and there is slightly different from the predefined pathways or procedures. In such cases, anomalies happen inevitably in clinical processes.

From clinical process perspective, anomalies are either one or a set of unexpected treatment activities (e.g., a unexpected antibiotic treatment, which is performed on a specific patient for his infections in clinical processes), or the predefined treatment activities performed at the unexpected occurring time-stamps in patient careflow (e.g., a postpone of a predefined surgery).

There is a good reason to expect that anomalies in clinical processes are also important. As a matter of fact, medical knowledge and techniques are continuously evolving, meaning that new treatment and diagnostic procedures are constantly being discovered that may invalidate current treatment pathways or require adaptations[3,4]. In addition, the medical decision process, as the procedures of interpreting patient-specific data according to medical knowledge individual experience of physicians, is complex and even unpredictable[5]. As physicians have the power to act according to their knowledge and experience, and need to deviate from defined clinical process patterns to deal with specific patient situations, the result is that there are clinical process traces with a high degree of anomalies[4]. In clinical environments where flexibility is important, anomalies in clinical processes may be related to patient treatment performance or event patient life survival. This may be especially true in emergent care unit or intensive care unit, where mindful, flexible treatment procedures are preferable.

Anomalies in clinical processes are usually conceptualized in terms of process outcomes, and a well-defined body of knowledge has developed that takes this perspective[2]. In clinical practice, for example, length of stay (LOS), mortality, and infection rate, etc., are commonly used measures of clinical process anomalies[9]. From the outcome perspective, clinical process anomalies are conceptualized as reliability (the standard deviation of some output parameter) and accuracy (changes in the mean value over time relative to some target)[9]. Accuracy and reliability of outputs are enormously important because they directly affect the performance of clinical processes. As valuable as these measures are, they restrict the attention to the analysis of clinical process anomalies, not the detection of these anomalies. From the operational perspective, timely identification of anomalies in patient-care journey is indeed desirable. Meaningful anomalies may be related to caring performance or even the patient survival. If these anomalies can be detected timely, the patient careflow could be more adaptive and efficient. As a matter of fact, anomaly detection in clinical processes is of increasing interest to medical staffs and hospital managers, who prefer mindful and timely adjustment of patient care and treatment in case of anomalies being detected.

Although many excellent studies have been proposed for anomaly detection in clinical processes[10,11], they assume prior knowledge about clinical process models (e.g., clinical protocol, clinical pathway, etc.)[4,5], using which anomalies are detected in a completely supervised manner. As we mentioned above, for a majority of clinical settings, clinical processes are highly dynamic, complex, and ad hoc, which lead to patient-care journey generally flexible and even unknown a prior. In addition, most of the previous approaches are based on the experiences and knowledge of clinical experts[3,4,12]. In particular, the analysts interpret large amounts of collected data, and elaborate anomalies in patient clinical process traces, piece after piece, which can be very tedious. Furthermore, it appears that analysis results are somehow influenced by perceptions, e.g., medical behaviors in clinical processes are often normative in the sense that they state what should be done rather than describing the actual medical behaviors in clinical processes. As a result, it tends to be rather subjective. The challenge, therefore, is how to detect anomalies automatically and objectively without prior knowledge of clinical process models.

Our approach to this challenge is based on our assumption that we can detect anomalies by mining careflow logs which regularly record normal and abnormal medical behaviors in clinical processes. As the heart of our assumption is the question, whether we can have appropriately descriptive yet robustly detectable careflow logs to record patient treatment and care behaviors in a variety of clinical settings. Since many hospital information systems regularly record a wide range of valuable data, such as which clinical activities are performed, who performed the activity, and when, these data can be organized in such a way that they contain a history of what occurred during clinical process execution, in a manner that facilitates making useful higher-level inferences. The idea of detecting anomalies in clinical processes by mining careflow logs is essential to move us away from the traditional subjective approaches for clinical process analysis, and adopt a more objective perspective.

To this regard, we present a novel approach towards anomaly detection from the collected careflow logs. The proposed approach consists of three steps: (1) patient trace representation; (2) patient trace clustering on the logs; and (3) detection of patient traces that deviate from characteristics of normal traces of discovered clusters. A detailed description of the proposed approach is as follows.

Patient trace representation

A patient trace of clinical processes, as a particular instance of patient careflow in the hospital, consists of different categories of medical behaviors, namely multidisciplinary clinical activities, and their dependencies[6]. In practice, many hospital information systems typically record all kinds of information of patient traces in careflow logs. Unfortunately, most systems use their own specific format of the logs. We, therefore, formalize the following concepts.

Definition 1 (Clinical event)

Let A be a set of clinical activity, and T the time domain. Let 𝒠 be the clinical event universe, i.e., the set of all possible event identifiers. We assume that clinical events are characterized by various properties, e.g., a clinical event has a time-stamp, corresponds to a clinical activity type, is executed by a particular medical staff, has associated patient profile, etc. We do not impose a specific set of properties, however, given the focus of this paper, we assume that two of these properties are the activity type and time-stamp of a clinical event, i.e., there are functions πa ∈ 𝒠 → A and πt ∈ 𝒠 → T assigning clinical activity types and time-stamps to clinical events, respectively. For convenience, we simply represent a clinical event e as e = (a, t), where aA is a particular activity type of e, and tT is a particular time-stamp of e.

For example, let e = (admission, 1) be a particular clinical event, where πa(e) = admission is the activity type of the event e, and πt(e) = 1 is occurring time of the event e.

Definition 2 (Patient trace)

A patient trace is represented as a sequence of clinical events, i.e., σɛ* such that each clinical event appears only once and time is non-decreasing, i.e., for 1 ≤ i < j ≤ |σ| : σ(i) ≠ σ(j) and πt(σ(i)) ≤ πt(σ(j)). In the sequence, if events occur at the same time, they are ordered alphabetically.

Note that we depict a simple example of patient traces, which are commonly used in the bronchial lung cancer care-flow, by using letters of the alphabet in Table 1. The meaning of the example alphabetic labels are described in Table 2. In Table 1, there are three patient traces represented as σ1, σ2 and σ3, respectively. For the particular trace σ1, its first clinical event σ1(1) = (a, 1), and its last clinical event σ1(15) = (v, 16).

Table 1:

A careflow log example of the bronchial lung cancer treatment process.

id sequence
σ1 〈(a, 1), (b, 1), (c, 1), (d, 1), (j, 1), (f, 2), (q, 2), (g, 4), (h, 4), (i, 4), (s, 7), (r, 8), (o, 13), (t, 15), (v, 16)〉
σ2 〈(a, 1), (b, 1), (c, 1), (d, 1), (e, 1), (f, 9), (g, 10), (h, 10), (i, 10), (j, 11), (s, 14), (r, 18), (k, 22), (t, 22), (l, 24), (v, 24)〉
σ3 〈(a, 1), (b, 1), (c, 1), (d, 1), (e, 1), (m, 2), (n, 2), (j, 3), (g, 4), (h, 4), (i, 4), (s, 7), (r, 13), (o, 17), (p, 17), (u, 19), (k, 21), (l, 21), (t, 21), (v, 21)〉

Table 2:

A set of clinical activities with the represented alphabets.

Abstraction Name Abstraction Name
a Admission l B-ultrasound examination
b Color ultrasound examination after admission m Cleansing enema
c ECG n Bronchoscopic treatment
d Pulmonary function tests o Configure anti-tumor chemotherapy
e Cardiac color Doppler ultrasound p Infrared Therapy/irradiated start
f Venous catheterization q Determination of left ventricular function
g Radical surgery r Post operative drain end
h Post operative drain start s Urinary retention guide end
i Urinary retention guide start t Color ultrasound examination before discharge
j Inhalation u Infrared Therapy/irradiated end
k Thoracentesis v Discharge

Definition 3 (Careflow log)

Let 𝒞 be the set of all possible patient traces (including partial traces). A care-flow log ℒ is a set of patient traces, i.e., ℒ ⊆ 𝒞.

For example, Table 1 shows an example of a care-flow log about the bronchial lung cancer treatment process. Note that a patient trace in a careflow log represents a particular clinical process instance also referred to as “case” of the treatment process of a patient.

Patient trace clustering

In this study, a careflow log is used to be mined for clinical process classification. In clinical processes, patients who have the similar symptoms, chief complaints, pathology examination results, and other clinical features, may accept similar medical behaviors. These patients with similar medical behaviors in their care journeys can be grouped into the same cluster. Thus, one can classify medical behaviors of a particular clinical process into several categories based on patient trace clustering.

Definition 4 (Patient trace cluster)

Let ℒ be a careflow log and dist(σ, ρ) be the similarity measure for any two traces σ and ρ in ℒ. The ℒ can be partitioned into multiple clusters ϕ1,, ϕn, such that ∑ijdist (σi, ρj) is maximized and ∑i=jdist(σi, ρj) is minimized, where σi ∈ ϕi and ρj ∈ ϕj, respectively.

This is the classic clustering problem of maximizing the inter-cluster distance and minimizing the intra cluster distances[13]. Given a reasonable distance measure dist(σ, σ), similar patient traces identified via clustering can detect all major clusters in a careflow log. In the rest of this section, we present the detail of the distance measure.

We would like to mention that a major challenge in clustering patient traces is the diversity of a careflow log, i.e. that a set of traces are structured very differently, obviously grows with the number of traces being analyzed. One possible solution to this problem is to reduce the number of traces, which are analyzed each time. Note that information about structure or content of a particular clinical process is usually not explicitly known, i.e., there is no available knowledge on how to partition the set of patient traces. One can, however, measure the similarity of traces and use this information to divide the set of traces into more homogeneous subsets.

In order to cluster similar patient traces, the similarity between pairwise traces is measured via computing the relative distance between the traces. Note that similarity measure is the process of comparing pairwise traces to determine the degree of matching on the two traces. For numeric values, the most commonly used similarity measure between pairwise traces is derived from the distance of clinical events of the two traces. Traditional techniques of sequence similarity measures are focused on direct matching between sequences applying commonly the classical distance concepts, e.g. Euclidean or Minkowsky distances, etc[14]. In[15], an ‘edit distance’ idea is proposed, which is based on the assumption that the distance between two temporal sequences could reflect the amount of work needed to transform one sequence into another. This approach has been widely used to measure distance in the analysis of textual strings and biological sequences, and has already been introduced into temporal sequence similarity measure[13]. However, traditional approaches ignore the implicit time-stamp information of clinical events in patient traces. As a matter of fact, the time-stamps of clinical events impacts the anomaly of patient careflow significantly. Thus, we argue that the similarity between pairwise traces σ and ρ should be defined based on the following two considerations: For one thing, the higher the number of matched events between σ and ρ are, the higher the similarity is. For the other thing, the similarity between the time-stamp information of clinical events should be taken into account. With regard to the considerations above, we present our method to measure the similarity between pairwise patient traces.

Formally, let σ(i) and ρ(j) be two specific clinical events on the ith and jth positions of pairwise patient traces σ and ρ, respectively. The temporal similarity between σ(i) and ρ(j) is defined as

ω(σ(i),ρ(j))={|(σ(i)σ(1))(ρ(j)ρ(1))max((σ(i)σ(1)),(ρ(j)ρ(1)))|ifπa(σ(i))=πa(ρ(j))0otherwise (1)

where the minus operation refer to the difference in time of clinical events, e.g., σ(i) – σ(1) is the time difference between clinical events σ(i) and σ(1).

Furthermore, the similarity between pairwise traces σ and ρ, denoted as sim(σ, ρ), can be calculated using the formula:

sim(σ,ρ)=1i=1max(|σ|,|ρ|)Δimax(|σ|,|ρ|) (2)

where σ and ρ are pairwise patient traces, Δi is the penalty that changes πa(σ(i)) in σ to πa(ρ(i)) in ρ. The similarity measure algorithm of pairwise patient traces, as described in Algorithm 1, evaluates the minimal total penalty to transform σ into ρ based on the concepts of dynamic programming. During the transformation, four basic edit operations of “no change”, “substitution”, “deletion” and “insertion” can be selected. The penalty of taking required edit operations that change πa(σ(i) in σ to πa(ρ(i) in ρ is defined as:

Δi={ω(ψσi,ψρi)ifno change,πa(σ(i))πa(ρ(i))1ifsubstitution,insertion,ordeletion (3)

The inputs of Algorithm 1 are pairwise patient traces σ and ρ, while the output is the similarity value between σ and ρ. The algorithm measures the minimal penalty of transforming σ to ρ, which starts from the top left of a minimal penalty matrix Matrix, traces through the matrix step by step, and stops at the bottom right of the matrix. As a result, the similarity between σ and ρ is calculated.

The objective of clustering methods that work on the presented similarity measure function of pairwise patient traces is to minimize the intra cluster distances and maximize the inter cluster distance. To this regard, we adopted the density based KNN clustering method[13] to generate partitions of patient traces of careflow logs. Generally speaking, a patient trace is “dense” if there are many traces similar to it in a careflow log. A patient trace is “sparse” if it is not similar to any others. Formally speaking, we measure the density of a patient trace by a quotient of the number of nearest neighbors, i.e., similar traces, k, against the space occupied by such similar traces, i.e., distance between similar traces. In particular, for each patient trace σ in the log ℒ. Therefore, given a particular k, which specifies the k-nearest neighbor region, we define the density of a particular patient trace σ in ℒ as follows: Let sim1, sim2,, simk be the k largest values of sim(σ, ρ), then,

Density(σ,k)=ng(σ)1sim*(σ) (4)

where sim*(σ) = min{sim1, sim2,, simk}, and ng(σ) = |{ρ|ρ ∈ ℒ ∧ sim(σ, ρ) ≥ sim*(σ)}| The presented clustering algorithm, as shown in Algorithm 2, is based on a uniform kernel density based KNN clustering method[13]. In this algorithm, each patient trace links to its closest neighbors. In Step 8–11, it initializes each patient trace as a cluster. In Step 12–23, it builds the single linkage clusters based on the density of traces. Then in step 24–31, the single linkage clusters satisfying the constraint depicted in Step 26 are linked to merging any local maximal regions. The density of a cluster is the maximum density over all patient trace densities in the cluster.

Algorithm 1.

Similarity measure between pairwise patient traces.

1: Procedure::Similarity_Measure(σ, ρ)
2: Input:
3:   σ and ρ are a pairwise patient traces
4: Output:
5:   sim is the similarity value between σ and ρ
6: Steps:
7:   If σ is empty, i.e., |σ| = 0
8:     Let sim = 0
9:   Else if ρ is empty, i.e., |ρ| = 0
10:     Let sim = 0
11:   Else
12:     Let Matrix = ∅ be a minimal penalty matrix
13:     For i = 1 to |σ|
14:       Let Matrix [i][0] = i
15:     End For
16:     For j = 1 to |ρ|
17:       Let Matrix [0][j] = j
18:     End For
19:     For i = 1 to |σ|
20:       For j = 1 to |ρ|
21:         let x1 = 0, x2 = 0, x3 = 0
22:         If πa(σ(j)) πa(ρ(i))
23:           x3 = Matrix [i − 1][j − 1] + ω(σ(i), ρ(j))
24:         Else
25:           x1 = Matrix [i − 1][j] + 1
26:           x2 = Matrix [i][j − 1] + 1
27:           x3 = Matrix [i − 1][j − 1] + 1
28:         End If
29:         Let Matrix [i][j] = min(x1, x2, x3)
30:       End For
31:     End For
32:     Let sim=1Matrix[|σ|][|ρ|]max(|σ||ρ|)
33:   End If
34:   Return sim
35: End Procedure

Algorithm 2.

Density-based k nearest neighbor clustering.

1: Procedure::Density_based_KNN_Clustering(ℒ, k)
2: Input:
3:   ℒ is a careflow log
4:   k is the number of neighbor traces
5: Output:
6:   Φ is the set of patient trace clusters
7: Steps:
8:   For each trace σ in ℒ
9:     set ϕσ= {σ}, Φ ⇐ Φ + ϕσ
10:     set Densityσ) = Density(σ, k)
11:   End for
12:   Let ℒ′ = ℒ
13:   For each trace σ in ℒ′
14:     let ρ1, ρ2, … ρn be the nearest neighbor of σ, where n = ng(σ) is defined in Equation (3).
15:     For each trace ρ ∈ {ρ1, ρ2, … ρn},
16:       If Density(σ, k) < Density(ρ, k) and there exists no ϱ having sim(σ, ϱ) > sim(σ, ρ) and Density(σ, k) < Density, k)
17:         let ϕ′σ= ϕσ ∪ ϕρ, where ϕσ containing σ and ϕρ containing ρ
18:         set the density of ϕ′σ to max{Density(σ), Density(ρ)}
19:         Φ ⇐ Φ + ϕ′σ – ϕσ – ϕρ
20:         ℒ′ ⇐ ℒ′ − ϕρ
21:       End If
22:     End For
23:   End For
24:   Let ℒ″ = ℒ
25:   For each trace σ in ℒ″
26:     If σ has no nearest neighbor with density greater than that of σ, but has some nearest neighbor ρ, with density equal to that of σ
27:       let ϕ′σ = ϕσ ∪ ϕρ, where ϕσ containing σ and ϕρ containing ρ
28:       Φ ⇐ + ϕ′σ − ϕσ − ϕρ
29:       ℒ′ ⇐ ℒ′ − ϕρ
30:     End If
31:   End For
32:   Output Φ
33: End Procedure

Apparently, different values of k will result in different clusters. However, it does not imply that the natural boundaries in the data change with k[13]. As shown in the experiments, a value of k in the range from 3 to 10 works well.

Note that each cluster of a clinical process represents a particular medical behavior category of that process. It is argued that while facing a new piece of information, humans first classify it into an existing information category[16], and then compare it to the previous members of the category to understand how it varies in relation to the general characteristics of the membership category. Once the “normality” has been roughly captured by the discovered clusters from a particular careflow log, one can look for those individual patient traces whose behavior deviates from the normal one.

Anomalous medical behavior detection

With regard to the set of patient trace clusters discovered in the previous step, it is possible to find if a particular patient trace σ is normal or anomalous. To this end, we assume that each discovered patient trace cluster ϕ represents a particular clinical process category, which is supported by a subset of patient traces in a careflow log ℒ (ϕ ⊆ ℒ). Traces of ϕ share a set of common properties that make them perceptually similar to each other, while also making them different from the traces of other clusters. If a particular patient trace σ has similar features with the traces of ϕ, we can say σ is regular with regard to ϕ, otherwise, σ is an anomaly.

To this end, similarities between σ and the traces of ϕ are combined to generate a conclusion of σ. Based on the similarity measure between pairwise traces, we compute the similarity between a particular patient trace σ and the previous members of each trace cluster by defining a function Δϕ(σ) as:

Δϕ(σ)=ρϕωϕ(ρ)sim(σ,ρ) (5)

where ωϕ(ρ) is the weight of each member ρ in the cluster ϕ, that indicates the participation of ρ in ϕ.

ωϕ(ρ)=1|ϕ|ρϕsim(ρ,ρ) (6)

Δϕ represents the average weighted similarity between a particular patient trace σ and any one of a membership cluster ϕ. The selected membership cluster ϕ* is found as:

ϕ*=argmaxϕΔϕ(ρ) (7)

Once the membership decision of a new particular trace has been made, we now focus our attention on deciding whether the new particular trace is normal or not. Intuitively speaking, we want to decide the normality of a new trace based on its closeness to the previous members of its membership cluster. This is done with respect to the average closeness between the previous members of its membership cluster. In particular, we define a particular trace σ as normal with respect to its membership cluster ϕ* if Δϕ* (σ) is larger than a particular threshold θ, i.e., if Δϕ* (σ) ≥ θ, σ is normal with regard to ϕ*. Otherwise, it is an anomaly.

Experiment

To test the feasibility of the proposed approach, experiments on data-sets collected from Zhejiang Huzhou Central hospital of China were performed. The explanation of the experimental setups and obtained results are presented in the following.

Experimental design

The experimental data sets were extracted from Zhejiang Huzhou Central hospital of China. Electronic Medical Records (EMR) system has been gradually used in the hospital since 2004. The system regularly records all kinds of information of clinical processes in the hospital. In the experiments, we extracted three specific careflow logs about bronchial lung cancer, colon cancer, and tuberculosis from the system. The collected data is from 2007/08 to 2009/09. In addition, we removed those unclosed or incomplete patient traces (e.g., the trace of which the patient died or was transferred during his/her LOS) from the collected logs. The details of reserved logs are shown in Table 3, including the patient trace number, clinical event number, clinical activity number, the average LOS, the minimum LOS, and the maximum LOS of each log. For example, the tuberculosis careflow log consists of 289 patient traces. The average LOS of these traces is 13.6 days while some traces take a very short time, e.g., only one day in hospital, and other traces take much longer, e.g., more than 50 days in hospital, which implicitly indicates the diversity of medical behaviors in tuberculosis clinical processes.

Table 3:

The collected careflow logs used in the experiments.

Disease # of trace # of event # of activity Average LOS (days) Min LOS (days) Max LOS (days) Standard Deviation (days) Median LOS (days)
Bronchial lung cancer 48 3405 225 27.2 14 40 6.966 29
Tuberculosis 284 13675 571 13.6 1 52 9.175, 12.5
Colon cancer 52 4840 292 23.1 12 39 5.971 23.5

The overview of the experimental flowchart involves three steps:

  1. By applying the proposed approach, we evaluate the normality of each patient trace in the collected careflow logs. In particular, we set up 10-fold cross validation experiments, which mean those traces in a particular careflow log would be split into ten partitions. Nine partitions are train data, and one partition is test data. Based on the train data set, the proposed anomaly detection model is built. Then, for the partition of test data set, the normality of each trace is calculated based on the learned model. More formally, we let rσ be the calculation result of a particular patient trace σ:
    rσ={1ifΔϕ*(σ)θ0otherwise (8)
    where Δϕ*σ is the normality of σ with regard to the selected membership cluster ϕ*. Note that ϕ* results in the maximum Δϕ*σ for σ. θ is a particular threshold value.
  2. Ask to the benchmark (or ground truth) evaluation data, we asked that experienced physicians of Zhejiang Huzhou Central hospital to evaluate those patient traces in the logs. As a result, 4 anomalies from the bronchial lung cancer careflow log, 62 anomalies from the tuberculosis careflow log, and 8 anomalies from the colon cancer careflow log are recognized in view of the experienced physicians. Formally, we let bσ be the clinical expert’s evaluation result of a particular patient trace σ. If clinical experts take σ as a normal trace, bσ = 1, otherwise, bσ = 0.

  3. The last step is the comparison between the calculation results and benchmark. The matrix “Recall” is gained, which are denoted by R:
    R=σrσbσ||×100% (9)
    where rσ and bσ denote the calculation result and the benchmark of the normality of σ, respectively. rσbσ indicates the coincidence between the benchmark and the calculation result on σ. In particular, we let rσbσ = 1 if rσbσ, otherwise, rσbσ = 0.

In comparison, we adopted two different similarity measure approaches in the experiments: (1) the proposed similarity measure method, and (2) the traditional edit distance based similarity measure[15].

Impact of parameter k

We study the impact of the parameter k on the experimental results, where k is the nearest neighbor parameter in the clustering step. We fix other settings (i.e., θ = 0.3) and vary the value of k from 3 to 10. Figure 1 shows the impact of k on ‘recall’ on the collected careflow logs. As shown in Figure 1, there are two curves. One curve is about the proposed similarity measure (denoted by ‘proposed’), the other is about the traditional edit distance method without considering the time-stamp information of clinical events (denoted by ‘edit distance’). We observe that ‘recall’ remain stable basically regardless the value of k on both curves. For example, as shown in Figure 1(a) about the bronchial lung cancer careflow log, recall is fairly stable for k = 3..10 at around 0.9375 with the proposed similarity measure. Figure 1(b) and (c) also demonstrate that there is a wide range of k that give comparable results about the tuberculosis careflow log and the colon cancer careflow log. In short, the experimental results are robust to k. This is a typical property of density-based clustering algorithms[13].

Figure 1:

Figure 1:

Impact of parameter k on the collected log.

In addition, let us compare the proposed similarity measure with the traditional edit distance method. As shown in Figure 1, the proposed method achieves a better performance on recall property than the traditional edit distance method regardless the value of k on three careflow logs. Especially, as we will mention the blow, there are a lot of unexceptional medical behaviors in the collected logs. It is quite remarkable that the use of the time-stamp information of clinical events can measure the similarity of pairwise traces more accurate than the traditional edit distance method.

Impact of parameter θ

We study the impact of θ on the collected careflow logs, where θ is the threshold value of normality of patient traces. We fix other settings (i.e., k = 6) and vary the value of θ from 0.1 to 0.5. In particular, we compare the proposed patient trace similarity measure method with the traditional edit distance method. As shown in Figure 2, there are two curves. One curve is about the proposed similarity measure (denoted by ‘proposed’), the other is about the traditional edit distance method without considering the time-stamp information of clinical events (denoted by ‘edit distance’). Selected results are given in Figure 2. The general trend of recall is observed in the collected logs. For example, as shown in Figure 2(a), recall on the curve ‘proposed’ remains stable with the initial increase of θ on the bronchial lung cancer careflow log, and when θ surpasses a certain value, i.e, θ ≥ 0.3, recall on the curve ‘proposed’ quickly decreases with the further increase of θ. Clearly, when θ = 0.3, the proposed approach can detect most of the anomalies from the benchmark data. Thus, as a conservative estimate, the default value for anomaly detection threshold value θ is set at 0.3.

Figure 2:

Figure 2:

Impact of parameter θ on the collected logs.

Furthermore, we compare the proposed method with the traditional method. As we observed in Figure 2, the proposed method outperforms the traditional edit distance method on the collected logs, which indicates that the use of the time-stamp information of clinical events is more efficient than the traditional edit distance method.

An anomaly example

Due to the length limitation of the paper, we introduce one simple example in this section. This anomaly was detected from the colon cancer careflow log. The clinical event sequence of the patient trace is listed as follows:

  • (‘Admission’, 1), (‘Bone marrow puncture’, 1), (‘Conventional ECG’, 1), (‘Tumor test’, 1), (‘Configuration of anti-tumor chemotherapy’, 2), (‘Blood test’, 4), (‘Colonoscopy’, 5), (‘Configuration of anti-tumor chemotherapy’, 5), (‘Venous catheterization’, 6), (‘Coagulation + D-dimer’, 10), (‘Electrolyte’, 10), (‘Glucose’, 10), (‘Liver and kidney function’, 10), (‘Color-ultrasound examination’, 11), (‘Determination of left ventricular function’, 11), (‘Ventilatory function tests ’, 11), (‘Catheterization start’, 12), (‘Central venous pressure measurement’, 12), (‘Deep venous catheterization’, 12), (‘Postoperative drainage start’, 12), (‘B-D heparin cap / second’, 13), (‘Blood gas analysis’, 13), (‘Replacement of fistula’, 13), (‘Ultrasonic atomizing inhalation start’, 13), (‘Blood culture’, 14), (‘Blood test’, 14), (‘Electrolyte’, 14), (‘General physical cooling’, 14), (‘Radical resection of colon cancer surgery’, 14), (‘Sugar, liver and kidney’, 14), (‘B-D heparin cap / second’, 15), (‘Central venous pressure measurement’, 16), (‘Replacement drainage bag’, 17), (‘Catheterization complete’, 18), (‘Ultrasonic atomizing inhalation’, 18), (‘Postoperative drainage complete’, 20), (‘Replacement of fistula’, 20), (‘Replacement of fistula’, 24), (‘Blood test’, 25), (‘Electrolyte’, 25), (‘Glucose’, 25), (‘Liver and kidney function’, 25), (‘Replacement of fistula’, 26), (‘Replacement of fistula’, 30), (‘Replacement drainage bag’, 31), (‘Replacement of fistula’, 33), (‘Electrolyte’, 35), (‘Glucose’, 35), (‘Replacement of fistula’, 35), (‘Liver and kidney function’, 35), (‘Discharge’, 39)〉.

Note that most of the clinical activities performed on this patient trace were commonly components of the colon cancer treatment process. And by using the edit distance based approach to measure pairwise trace similarity, this patient trace is falsely recognized as a normal trace. However, the occurring time of clinical activities in the trace were lagged obviously comparing with normal traces in the careflow log. For example, in the colon cancer careflow log, more than 70% patients (i.e., 31 out of 52 traces in the log) whose radical resection of colon cancer surgeries were performed between 3 and 7 days after ‘Admission’, while the patient of this anomaly performed his surgery on the 14th day after ‘Admission’. The significant postpone of the surgery is a typical temporal feature of the anomaly. In addition, many activities which are commonly performed on the first day or the second day after ‘Admission’ were conducted much later in the trace (e.g., the activity ‘Blood test’ was performed on the fourth day after ‘Admission’). Furthermore, the LOS of this anomaly is 39 days, which is much longer than the average LOS of the colon cancer patient traces in the log (i.e., 13.1 days). The simple example indicates that the proposed approach is feasible to detect anomalies in clinical processes.

Conclusion and future work

In this study, we propose a computational model for anomaly detection in clinical processes. As the notion of anomaly is closely related to what is meant by normal, we have modeled anomalies as patient careflow traces that deviate from behaviors perceived as normal in clinical processes. Using a density-based patient trace clustering approach on the logs to learn the notion of normality of a particular patient trace, it is possible detect anomalies that deviate from normal medical behaviors in clinical processes.

Clinical evaluation of the proposed approach is in progress. We are having further discussions with clinical experts for improvements, and evaluating the proposed approach in various clinical settings. Moreover, we are working with developers on embedding this technique into EMR system, which is supposed to provide practitioner reminders and alerts as well as availability of complete and accurate data for physicians, to achieve high quality of clinical processes performance and prevent medical errors with an adaptive mechanism for clinical processes.

For future research, we envision to answer a practical question, that is not how to limit anomalies through increased controls, but how to foster ‘good’ anomalies (that results in improved patient-care performance, for example), and how to limit ‘bad’ anomalies (that results in decreased patient satisfaction, for example). This could be followed by more formal analysis of clinical processes in order to document and systematically measure the extent of anomalies and to explore the appropriateness of these anomalies in different clinical situations.

References

  • 1.Dadam P, Reichert M, Kuhn K. Clinical workflows - the killer application for process-oriented information systems?. the 4th International Conference on Business Information System; 2000. [Google Scholar]
  • 2.Peleg M, Gutnik LA, Snow V, Patel VL. Interpreting procedures from descriptive guidelines. Journal of Biomedical Informatics. 2006;39(2):184–195. doi: 10.1016/j.jbi.2005.06.002. [DOI] [PubMed] [Google Scholar]
  • 3.Hunter B, Segrott J. Re-mappling client journeys and professional identities: A review of the literature on clinical pathways. International Journal of Nursing Studies. 2008;45:608–625. doi: 10.1016/j.ijnurstu.2007.04.001. [DOI] [PubMed] [Google Scholar]
  • 4.Rebuge A, Ferreira DR. Business process analysis in healthcare environments: A methodology based on process mining. Information Systems. 2012;37(2):99–116. [Google Scholar]
  • 5.Huang Z, Lu X, Duan H. Using recommendation to support adaptive clinical pathways. Journal of Medical Systems. 2011:1–12. doi: 10.1007/s10916-010-9644-3. [DOI] [PubMed] [Google Scholar]
  • 6.Lu X, Huang Z, Duan H. Supporting adaptive clinical treatment processes through recommendations. Computer Methods and Programs in Biomedicine. 2011 doi: 10.1016/j.cmpb.2010.12.005. (0):–. [DOI] [PubMed] [Google Scholar]
  • 7.James BC, Savitz LA. How intermountain trimmed health care costs through robust quality improvement efforts. Health Affairs. 2011;30(6):1185–1191. doi: 10.1377/hlthaff.2011.0358. [DOI] [PubMed] [Google Scholar]
  • 8.Huang Z, Lu X, Duan H. On mining clinical pathway patterns from medical behaviors. Artificial Intelligence in Medicine. 2012 doi: 10.1016/j.artmed.2012.06.002. [DOI] [PubMed] [Google Scholar]
  • 9.Huang Z, Lu X, Gan C, Duan H. Variation prediction in clinical processes. In: Peleg Mor, Lavrac Nada, Combi Carlo., editors. Artificial Intelligence in Medicine volume 6747 of Lecture Notes in Computer Science. Springer Berlin; Heidelberg: 2011. pp. 286–295. [Google Scholar]
  • 10.Okita A, Yamashita M, Abe K, Nagai C, Matsumoto A, Akehi M, et al. Variance analysis of a clinical pathway of video-assisted single lobectomy for lung cancer. Surgery Today. 2009;39 doi: 10.1007/s00595-008-3821-8. [DOI] [PubMed] [Google Scholar]
  • 11.van de Klundert J, Gorissen P, Zeemering S. Measuring clinical pathway adherence. Journal of Biomedical Informatics. 2010;43(6):861–872. doi: 10.1016/j.jbi.2010.08.002. [DOI] [PubMed] [Google Scholar]
  • 12.Westbrook JI, Coiera EW, Gosling AS, Braithwaite J. Critical incidents and journey mapping as techniques to evaluate the impact of online evidence retrieval systems on health care delivery and patient outcomes. International Journal of Medical Informatics. 2007;76(2–3):234–245. doi: 10.1016/j.ijmedinf.2006.03.006. [DOI] [PubMed] [Google Scholar]
  • 13.Kum HC, Chang J, Wang W. Sequential pattern mining in multi-databases via multiple alignment. Data Mining and Knowledge Discovery. 2006;12:151–180. [Google Scholar]
  • 14.Agrawal Rakesh, Srikant Ramakrishnan. Mining sequential patterns. Proceedings of the Eleventh International Conference on Data Engineering; Washington, DC, USA. IEEE Computer Society; 1995. pp. 3–14. ICDE ’95. [Google Scholar]
  • 15.Gusfield D. Algorithms on strings, trees and sequences, Computer Science and Computational Biology. Cambridge University; 1997. [Google Scholar]
  • 16.Rosch E, Mervis C, Gray W, Johnson D, Boyes-Braem P. Basic objects in natural categories. Cognitive Psychology. 1976;8:382–439. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES