Abstract
Clinical pathways are used to guide clinicians to provide a standardised delivery of care. Because of their standardisation, the aim of clinical pathways is to reduce variation in both care process and patient outcomes. When learning clinical pathways from data through data mining, it is common practice to represent each patient pathway as a string corresponding to their movements through activities. Clustering techniques are popular methods for pathway mining, and therefore this paper focuses on distance metrics applied to string data for k-medoids clustering. The two main aims are to firstly, develop a technique that seamlessly integrates expert information with data and secondly, to develop a string distance metric for the purpose of process data. The overall goal was to allow for more meaningful clustering results to be found by adding context into the string similarity calculation. Eight common distance metrics and their applicability are discussed. These distance metrics prove to give an arbitrary distance, without consideration for context, and each produce different results. As a result, this paper describes the development of a new distance metric, the modified Needleman–Wunsch algorithm, that allows for expert interaction with the calculation by assigning groupings and rankings to activities, which provide context to the strings. This algorithm has been developed in partnership with UK’s National Health Service (NHS) with the focus on a lung cancer pathway, however the handling of the data and algorithm allows for application to any disease type. This method is contained within Sim.Pro.Flow, a publicly available decision support tool.
Keywords: Clinical pathways, Data mining, Lung cancer
Graphical abstract
Highlights
-
•
Complex clinical pathways may require variation reduction to aid understanding.
-
•
Patient pathway strings enable k-medoids clustering using string distance metrics.
-
•
Eight common distance metrics discussed ultimately lack consideration for context.
-
•
Produces a modified Needleman–Wunsch algorithm combining data and expert knowledge.
-
•
Case study applies the method to lung cancer.
1. Introduction
Lung cancer is in the top ten causes of death, the most common cause of cancer death in men, and second most common in women, worldwide [1]. Cancer mortality can be reduced with early treatment and detection. As a consequence, the goal of many organisations that provide cancer services, is to reduce the time to diagnose and treat cancer.
In the age of digital health, the organisation of health information into interactive clusters and other novel methods for stratifying health data will complement existing approaches and potentially lead to improvements in health care [2]. As health information technology (IT), such as electronic health records (EHRs), gain widespread adoption and use in healthcare industry, thereby accumulating vast amounts of real-time patient care data, there is tremendous opportunity to develop data-driven models, methods and tools to facilitate review of practice workflows and improve evidence based care delivery by learning practice-based pathways of care [3], [4], henceforth denoted as clinical pathways.
When considering clinical pathway modelling, a primary question is often to consider what is the pathway. A recent review of the current literature [5] highlighted that there are many data mining and machine learning methods available for answering such questions. However, it was clear that most of these techniques only consider the pathways discoverable from data, and do not consider the wealth of information available from the experts that interact with the pathway day to day. The benefit of consulting with experts is that they may be able to explain some obscure or outlier information that can be picked up within the data. It is speculated that the lack of interaction between using both data and expert knowledge is due to the time consuming nature of such a process.
Clustering techniques were highlighted in the literature [5] as the most popular method for pathway discovery. Similarly, this paper focuses on distance measures applied to string data for the purpose of k-medoids clustering [6]. This method was chosen as firstly the data used is similar to that of Vogt et al. [7], and secondly using an existing pathway as the centroid reinforces the medical experts confidence in the pathway chosen as being realistic. Clustering methods do not hold the same limitation in regards to restricting that each activity can only be performed once that other methods have, making it more versatile and applicable.
This paper discusses the development of a new distance metric, modified from the Needleman–Wunsch algorithm, to allow for consideration of both data and medical expert information, for the use with clustering. Eight other popular distance metrics are discussed and used as reference for benchmarking the performance of the modified metric. The main dataset contains 2350 non-small cell lung cancer referrals provided by Velindre Cancer Centre (VCC), a cancer centre in the UK’s National Health Service (NHS).
The content is structured as follows: Section 2 contains a discussion of previous research, Section 3 gives a description of the problem, Section 4 discusses some current metrics and their properties, Section 5 details the development of the new algorithm, Section 6 applies the method to case studies. The paper closes with a conclusion and recommendations for further work.
2. Previous research
Aspland, Gartner and Harper [5] conducted an in-depth literature review on clinical pathway modelling which provides a taxonomy of problems related to clinical pathways and explores the intersection between methods drawn from Information Systems, Operational Research and Industrial Engineering. There were 82 papers in the review [5] which stated using data mining or machine learning, for mapping, modelling or improving the clinical pathway. Table 1 further categorises these papers into specific method areas.
Table 1.
Publications categorised as data mining or machine learning method.
| Method | |
|---|---|
| Clustering | [3], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18] |
| Categorised | [19], [20], [21], [22], [23] |
| Classified | [24], [25] |
| Topic modelling | [26], [27] |
| Probabilistic | [20], [28], [29] |
| Latent dirichlet allocation | [25], [30], [31], [32], [33], [34], [35], [36] |
| Pattern mining | [14], [21], [37] |
| Sequential pattern mining | [17], [38], [39], [40], [41], [42], [43], [44], [45] |
| Temporal pattern mining | [39], [46], [47] |
| Process mining | [13], [14], [15], [35], [48], [49], [50], [51], [52], [53], [54], [55], [56] |
| Bayesian | [22], [57], [58], [59], [60] |
| Markov | [3], [61], [62], [63], [64] |
| Heuristics | [10], [65], [66], [67], [68] |
| Semantic web rule language | [69], [70] |
| Artefact | [60], [71], [72], [73], [74] |
| Business Process Model and Notation (BPMN) | [75], [76], [77] |
| Other | [23], [78], [79], [80], [81], [82], [83], [84], [85], [86], [87] |
It can be seen that clustering was the most popular method. On closer inspection there are multiple methods of clustering used, for example, Funkner et al. [10] use K-means, Vogt et al. use K-medoids [7] and Zhang et al. use hierarchical [3]. Furthermore, the differences go deeper when considering the distance measures used during clustering, as Funkner et al. [10] uses Levenshtein distance, Syed and Dias [43] modify the Needleman–Wunsch Algorithm, whereas Vogt et al. [7] and Zhang et al. [3] use Longest Common Subsequence (LCS).
Aspland, Garter and Harper [5] also highlighted that there are two common ways of obtaining the pathway: either data-driven or through collaboration with experts who regularly interact with the pathway. Data-driven pathway discovery was most popular, containing 90 papers, compared to 13 papers that considered collaboration only.
Aspland, Gartner and Harper [5] state that there are 14 papers that considered information from both of these sources [52], [53], [71], [88], [89], [90], [91], [92], [93], [94], [95], [96], [97], [98]. All of these papers consider data alongside expert opinion, interviews or literature, and do so in a way that they enhance or fill in for missing information.
None of the papers integrate the two sets of information in a simple and direct manner. Furthermore, considering just one of these methods leaves a wealth of knowledge that is not considered.
3. Problem description
The pathway for cancer diagnosis starts at referral and ends at start of treatment, and contains many steps in between which detect the stage of the cancer. Within the UK there are different guidelines of how to conduct the cancer pathway, which are summarised in Table 2. In Wales, the National Optimal Lung Cancer Pathway (NOLCP) [99] is currently in place, and is currently in the process of being replaced by the Single Cancer Pathway [100]. For ease of understanding, we have converted the NOLCP to a simplified version just containing the activities and maximum time frames for completion (Appendix Fig. A.15).
Table 2.
UK and Ireland cancer pathway guidelines.
| Country | Guideline | Provider |
|---|---|---|
| England | National Optimal Lung Cancer Pathway (NOLCP) | Cancer Research UK [99] |
| Wales | Single Cancer Pathway, National Optimal Pathway for Lung Cancer | Wales Cancer Network [100], NHS Wales [101] |
| Scotland | Management of lung cancer | Healthcare Improvement Scotland [102] |
| Northern Ireland | Lung Pathway | Northern Ireland Cancer Network [103] |
| Ireland | Lung Cancer Action Plan | Irish Cancer Society [104] |
Fig. A.15.
Simplified National Optimal Lung Cancer Pathway.
Fig. A.15 (Appendix) shows that, a patient is rarely allowed to attend the same activity more than once. In fact, each activity was only recorded once in the dataset, putting a hard restriction on not allowing multiple attendances of an activity.
To adhere to this constraint all of the past performed activities would need to be considered when choosing the next activity to avoid duplication. As the memory-less property of Markov chains only allows the directly preceding activity to be considered, using Markov chains was not appropriate for our data. Therefore clustering was chosen as an appropriate method.
The data set used contains date stamps for each patient and each activity that was performed. To first extract the pathways from the data set, each activity is assigned a letter code, and then the activities are ordered by the date that they occurred, and joined together to form a string of letters.
For example, if a patient was first seen on 01/01/2019, then received a diagnosis on 02/01/2019 and then their case was discussed at a Multi-Disciplinary Team Meeting (MDT) on 03/01/2019, and these activities were assigned the letter codes A, B and C respectively, then the pathway for this patient would be ABC.
To aid with visualisation of this, Fig. 1 shows a heatmap displaying the pathways, where the data has been ordered alphabetically. Along the -axis is the position of the activity, and the -axis is the number of patients, where each integer represents one patient. Furthermore, each activity code has been assigned a colour, and thus the heatmap represents the patient pathways as a line of various colours.
Fig. 1.
All patient pathways displayed as a heatmap.
Fig. 1 shows that there is a large amount of variation in the position, number and sequence of the activities performed. This indicates that condensing this large variation into a simple clinical pathway to be used as guideline is a difficult task.
4. Description of metrics
There are many different possible metrics that can be used to compare two strings, given that the Python library textdistance [105] (a library to compare distance between two or more sequences) hosts over 30 algorithms for this purpose. Eight different metrics were considered to use as comparison and benchmarking for the modified algorithm, which cover edit distances, token based and sequence based distances. These eight metrics were chosen as they most appropriately fit the purpose, reflect the literature and show a variety of techniques.
4.1. Edit distances
Here there are five edit distances considered: Levenshtein, Damerau–Levenshtein, Jaro, Jaro–Winkler and Needleman–Wunsch.
Levenshtein
The Levenshtein distance was developed in 1965 for the use of correcting deletions, insertions and reversals of binary codes [106]. The general idea is to evaluate the distance between two strings as the number of single-character edits required to change one string into the other. There are many current uses for the Levenshtein distance, e.g. spell checkers, optimal character recognition correction systems and linguistic distance, to name a few.
The Levenshtein distance can easily be calculated by hand, by giving a penalty of one to each insertion, deletion or substitution, as demonstrated in Fig. 2.
Fig. 2.

Example of the calculation for the Levenshtein distance.
The Levenshtein distance can be translated into a dynamic programming algorithm displayed in Algorithm 1. The dynamic programming matrix X for the example from Fig. 2 can be seen in Fig. 3.
Fig. 3.

Example of dynamic programming using the Levenshtein distance.
Damerau–Levenshtein
The Damerau–Levenshtein is an extension of the Levenshtein distance, where transpositions (swapping positions of adjacent letters) are also allowed [107].
An example of the hand calculation for the Damerau–Levenshtein distance can be seen in Fig. 4. Again, this can also be performed using dynamic programming.
Fig. 4.

Example of the calculation for the Damerau–Levenshtein distance.
Jaro
The Jaro similarity was first developed for the purpose of record linkage [108], [109]. The formula considers four variables: the length of both strings (a,b), the number of matching characters (m) within a tolerance (T), and the number of transpositions within those matching characters (t). The formula for Jaro similarity is as follows:
| (1) |
where the tolerance (T) for m is calculated by
and only the integer-part is used. For further clarity, two characters are only considered matching if they are within T places of each other.
This will produce a value between 0 and 1, where 1 indicates that the strings are identical, and therefore a larger value is desired.
To calculate the distance instead of similarity, the metric needs to be adjusted by performing .
For example, in Fig. 5 there are 4 matches within the tolerance of 1 (see below), shown in green, however C and O need to be transposed.
Fig. 5.

Example of the calculation for the Jaro distance. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The calculations for the example in Fig. 5 are: a 5, b 5, T 5/2 1 1, m 4, t 1
| (2) |
| (3) |
Jaro–Winkler
The Jaro–Winkler distance is an extension of the Jaro distance [110] through the following formula:
| (4) |
where is in number of common prefix before the first non-match, up to a maximum of 4, and is a scaling factor which should not exceed 0.25. Typically is chosen to be 0.1. Again, to calculate the distance instead of similarity, the metric needs to be adjusted by performing .
Applying this calculation to the example, as is 0 (because the first position is a non-match), we would get the same result as the Jaro distance (0.2166666667). Therefore, the example is changed slightly as shown in Fig. 6.
Fig. 6.

Example of the calculation for the Jaro–Winkler distance.
Then first calculating the Jaro distance to allow for calculating the Jaro–Winkler distance is as follows: a 4, b 5, T 5/2 1 1, m 3, t 0
| (5) |
| (6) |
| (7) |
| (8) |
| (9) |
Needleman–Wunsch
The Needleman–Wunsch algorithm was first used in bio-informatics to align protein or nucleotide sequences, and makes use of dynamic programming [111]. It may also be referred to as the optimal matching algorithm or the global alignment technique.
This is a generalised variant of the Levenshtein distance, where values for match, swap and gap are chosen by the user. The most common values chosen for these variables are: Match (m) 1, Swap (s) 1 and Gap (g) 1. Again, this can easily be checked by hand, as shown in Fig. 7.
Fig. 7.

Example of the calculation for the Needleman–Wunsch distance.
The Needleman–Wunsch algorithm also makes use of dynamic programming to computationally calculate the distance. The pseudo-code for which can be seen in Algorithm 2.
An example of the matrix produced using the Needleman–Wunsch dynamic programming algorithm can be seen in Fig. 8.
Fig. 8.

Example of the Needleman–Wunsch algorithm.
Once the matrix such as that in Fig. 8 has been produced, we can perform traceback to find the alignment. This means, starting at the bottom-right corner of the matrix, and working back through the matrix to the top-left 0, and noting the direction that the value came from. This is highlighted in Fig. 8 by the black arrows.
A diagonal arrow indicates an alignment, an arrow to the left indicates that the character in the left string is aligned with a gap, and an arrow straight up indicates that the character in the top string is aligned with a gap.
The textdistance library [105] does not easily allow for alternative values of m, s and g to be used.
4.2. Token based distances
Here are discussed two token based distances, Jaccard and Cosine respectively. In this context token means a partition of the string, and in both of these distances this relates to n-grams. Furthermore, an n-gram is defined as a continuous sequence of n items.
Jaccard distance
The Jaccard distances [112] is calculated using the following equation.
| (10) |
An example of n-grams, where n 2 (bi-gram), for the two strings BKAOC and KABCO, as required for the Jaccard distance can be seen in Fig. 9. Applying the formula to this example yields:
| (11) |
Fig. 9.

Example of bi-gram for Jaccard distance.
Cosine distance
The Cosine distance is typically used to compare the number of similar words in a document and also in data mining to measure cohesions in clusters [113].
Firstly, calculating the Cosine similarity of both n-grams, using the following equation:
| (12) |
Then, as previously, 1 minus the similarity needs to be performed to obtain the distance. Applying this calculation to the example with previously calculated the n-grams in Fig. 9. This results in a cosine distance of .
4.3. Sequence based distances
Longest common subsequence
The longest common subsequence (LCS) refers to the longest subsequence common to both sequences, where the subsequences do not have to occupy consecutive positions, but do have to be in sequence. Fig. 10 displays that the LCS for our example is 3.
Fig. 10.

Example of longest common subsequence.
To consider LCS as a distance, we need to consider what remains when you remove the LCS. In Fig. 10 this would be what remains in white, and therefore would give a LCS distance of 2.
It has been shown that this is an NP-hard problem [114], and as such dynamic programming has been utilised to allow for computation. The pseudo-code for the dynamic programming of the LCS can be seen in Algorithm 3.
Fig. 11 illustrates the dynamic programming calculation for the example in Fig. 10.
Fig. 11.
Example of longest common subsequence.
4.4. Properties of metric
When selecting an appropriate distance metric, it is important to consider which properties are important when calculating similarity. There are three key properties that can be considered with string metrics, namely length, sequence and position.
Length: It is apparent that considering strings of differing length is a common occurrence in process data, in particular with medical diagnosis, as it is a process of discovery and one that may need different activities based on the results of a previous one. Therefore, the algorithm needs to consider the differing length of two strings.
Sequence: The sequence in which activities occur is important and must be considered, especially when considering the previous statement that the results of one activity may change the course of the pathway.
Position: The position that the activity, and the sequence of activities, occurs within the pathway is vitally important to consider when developing an algorithm for process data.
All of these properties are considered in varying degrees in each of the eight metrics considered in the previous section. For example, length is evidently considered in the Jaro calculation, as it is a main variable in the formula, whereas in the Levenshtein distance length is indirectly considered via the upper and lower bounds for the possible values (upper bound length of the longer string, lower bound the difference between the lengths of the strings). Furthermore, sequence is evidently considered in LCS, as it is in the name, whereas sequence is considered in an alternative way in the Jaccard distance through the use of n-grams.
This shows that string distance metrics do possess the correct qualities to be applied to process data.
The string distances are currently underperforming when considering small differences between strings. The addition of an extra letter will be considered, but it does not make a difference what letter it is or what it represents. This leads to many string comparisons resulting in the same value (as seen in Appendix Fig. A.17, Fig. A.18). It is theorised that this will lead to poor cluster distinction and adding some uniqueness will improve upon this.
Fig. A.17.
Comparison of the Ten Metrics Applied to Sample 1. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. A.18.
Comparison of the Ten Metrics Applied to Sample 2. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
In attempt to address this it was evident that complete uniqueness was difficult to achieve as it violated some fundamental basic relationships (such as symmetry). However, adding more uniqueness than is currently displayed in the distance metrics was successful.
Addressing the previous property, allowed for the ability to include some meaning to the strings. As discussed previously, there is no consideration in the metrics for which letter has been added and what that might represent. This is likely due to the origins of the metrics typically being for spell checkers etc. where there is no need to consider this. However, in terms of process data, it can cause quite a difference when considering the addition of letter A or letter B depending on what activities they represent.
In summary, we aim to modify the Needleman–Wunsch algorithm to allow for more uniqueness in the values to achieve better clustering results, through the addition of context provided by experts. The process for this is explained in more detail in context in Section 5.
5. Modified Needleman–Wunsch algorithm
This section discusses the development of the Modified Needleman–Wunsch algorithm to achieve adding uniqueness and context to the comparisons.
The Needleman–Wunsch metric was chosen as the base for this modification, as it had the greatest potential to modify the calculation in a meaningful way. As the intention for this modified metric was to be applied to clustering process data, three fundamental properties need to be preserved: (1) a point to itself receives a score of 0, (2) symmetry must hold and (3) a smaller value is indicative of a closer match. This will be addressed in the discussion concerning penalty values.
5.1. Variables
The first modification considered is the idea that not all activities should be allowed to swap with each other. This is because, considering the pathway from a resource planning perspective and the interaction between multiple care centres, allowing all activities to swap could lead to very different pathways being considered similar. For example, allowing an X-ray under primary care supervision, is very different from an MDT meeting consisting of multiple personnel from the secondary and tertiary centres, from a resource perspective.
To allow for this, a no-swap variable (ns) needs to be defined. Furthermore, the algorithm needs to be able to decipher which activities are allowed to swap with each other. This leads to the introduction of groups of activities, where essentially, if activities are in the same group then they are allowed to swap.
5.2. Groupings
The experts will be asked to group activities that happen at similar points in the pathway into the same group. It should be explained that the purpose of these groups is that if two patients performed different activities at the same point in their pathway, but these activities are in the same group, then they would be seen as more similar to each other than if the activities were in different group. An example of the groupings used for the case study are provided in Table 3.
This permits greater meaning to be given to the pathways, however this does not lead to the values being more unique. This is addressed by using weightings and is discussed in the next section.
Table 3.
Grouping assignments for each activity.
| Group | Activities |
|---|---|
| 0 | A,B,C,O |
| 1 | D |
| 2 | E,F |
| 3 | G,H |
| 4 | I,J |
| 5 | K,L,N |
| 6 | M |
5.3. Weightings
The inclusion of weightings into the algorithm increased the complexity, and as such now becomes more difficult to calculate by hand.
We first discuss how to assign the weightings to the activities, and then follow with combining these into the algorithm.
Assume that domain experts (e.g. consultants in cancer services) are asked to rank the activities from most to least important (0 to N-1, where N is the number of activities). This can be thought of as, the activity that occurs most often is seen as most important, and thus ranked 0, and those activities that are more rarely occurring should be ranked as lesser important. From these rankings, they will then be converted into weightings where the least important activity will be assigned a weight of 1, and each activity will receive an incremental addition of 1/(N-1). This subsequently gives the most important activity a weight of 2.
For example, Table 4 shows the rankings and resulting weightings (rounded to 3 d.p.) that were applied to the case study activities.
Table 4.
Ranking and weighting results for each activity.
| Activity | Rank | Weighting |
|---|---|---|
| A | 2 | 1.857 |
| B | 0 | 2.0 |
| C | 1 | 1.929 |
| D | 12 | 1.143 |
| E | 10 | 1.286 |
| F | 9 | 1.357 |
| G | 7 | 1.5 |
| H | 6 | 1.571 |
| I | 13 | 1.071 |
| J | 14 | 1.0 |
| K | 3 | 1.786 |
| L | 5 | 1.643 |
| M | 8 | 1.429 |
| N | 11 | 1.214 |
| O | 4 | 1.714 |
5.4. Equations
As we have now defined both the groupings and weightings, we can combine these into the algorithm. We will first methodically work through the equations, including explanations, and then provide the pseudo-code.
Firstly, the match equation is as follows:
| (13) |
The match equation had to be modified using multiplication of the m parameter, to allow the initial 0 to propagate through. This is the main element that allows for a point to itself to be 0 (as required by the fundamental properties of metrics introduced in the beginning of Section 5).
The inclusion of the previous matrix value (X[i-1][j-1]) is required in the denominator to control the magnitude, and ensure that the penalty value for a match will not exceed 1.
Furthermore, as a match is a positive event, we needed to ensure that in this case, a more important activity has a smaller impact than a lesser important activity. This is the reason for the 1 over weight.
Moving on to the swap equation:
| (14) |
This is more intuitive, as the modification is the addition of the absolute difference of the two weightings. This results in activities that are allowed to swap, but are ranked further apart will have a larger value than those that are ranked closer.
Now considering the no-swap equation:
| (15) |
This ensures that the no swap value is large enough to never get chosen in the matrix.
The gap equations are only slightly modified through the addition of the corresponding weighting of that direction:
| (16) |
| (17) |
The final modification from the Needleman–Wunsch algorithm is that now we select the minimum of D, L, T opposed to the maximum. Algorithm 4 displays the pseudo-code for the modified Needleman–Wunsch algorithm.
5.5. Penalty values
In the literature surrounding the Needleman–Wunsch algorithm, it is often discussed that the user can specify the values for the match, swap and gap penalty, however there are no guidelines surrounding these.
We developed the following equations as guidelines, to ensure that the preference of, match swap gap no-swap, holds when choosing values for the variables.
For further clarification, m must be set to 1 as the match equation considers a multiplication, and otherwise the factor is not consistently less than 1 (more clarification below). Moreover, it is unnecessary for ns to be larger than 2g 1, as this is sufficient to consistently force gaps when a no swap is necessary.
As a result, the smallest possible penalty values are: , , , ns.
As with the standard Needleman–Wunsch algorithm, changes to the penalty values will result in different distances calculated, which will propagate through to the clustering. Advice to the user when selecting the values of and in particular, is to select values with a larger difference between s and g to ensure a more distinct separation of these two actions.
5.6. Example
Fig. 12 calculates the modified Needleman–Wunsch distance between the two pathways ABKOGNCH and ABC, using the values m 1, g 2, s 2 and ns 5, with the groupings and weightings from Table 3, Table 4 respectively.
Fig. 12.
Example of modified dynamic programming algorithm.
Fig. 13 shows the resulting alignment from following the traceback.
Fig. 13.

Example of modified traceback.
Consider that intuitively it should always be better to take a swap over a gap. However, looking at the interaction between B and O, it can be seen that this is not the case, as the value from the gap is smaller than that of the allowed swap. At first glance, this may seem incorrect, until further inspection when it is clear that this is necessary to allow the alignment of B with itself two steps later.
This demonstrates the intelligence of the algorithm, and the consideration for the string as a whole during traceback.
5.7. Features
The modified algorithm allows for many features to be considered, which are as follows:
-
1.
Point to itself is 0
-
2.
The distance score for the string is 0 until the first non-match (similar to the common prefix idea in the Jaro–Winkler distance)
-
3.
Distances between two pathways are commutative
-
4.
Matches between higher importance activities produce a smaller distance
-
5.
A match earlier in the string will result in a smaller value than that appearing later
-
6.
Gaps with higher importance activities are larger value than that of lower importance
-
7.
Swaps of activities that are closer in terms of rankings will produce a smaller value
Fig. A.16 (Appendix) displays all the features described above for Sample 2 (explained below) using penalty values m 1, g 2, s 2, ns 5.
Fig. A.16.

Modified Needleman–Wunsch Distance Matrix for Sample 2.
To add commentary to Fig. A.16 (Appendix), feature 1 is displayed along the diagonal of the matrix, and feature 3 (commutativity) is displayed, and thus one can ignore the bottom diagonal of the matrix, and just examine the top diagonal.
Feature 2 can be confirmed by matrix locations (1,2), (1,3) and (1,4), as the value corresponds to with the addition of the weight for the additional letter as displayed in Table 4. These three values also confirm feature 6.
Features, 4 and 7, are displayed amongst Fig. A.16 (Appendix), but can easily be checked manually by combining the weightings in Table 4 with the equations for the match and swap (Eqs. (13), (14)) respectively.
Feature 5 is the most complex and a by-product of feature 1. This feature arises due to the match penalty calculation being a factor or the previous value (as previously discussed in the context of Eq. (13)). This feature can be seen in matrix locations (1,2) compared to (1,5), where (1,2) is smaller than (1,5) as the match of C happens earlier in (1,2) than in (1,5). To further display this feature, consider the string C compared with the following three string: (1) DC, (2) HC, and (3) DHC. Fig. 14 shows the full calculation matrix of each of the three scenarios. If we calculate the impact of matching C in each scenario by observing the difference between the two values (indicated by the diagonal arrow in Fig. 14), as follows:
| (18) |
Fig. 14.
Example of feature five.
Eq. (18) shows that the penalty for matching C is different in all three scenarios. Simplified, if the previous value is larger then the effect of matching C is also larger. Hence, the later a match appears in the string, the larger the value.
In conclusion, the modified Needleman–Wunsch algorithm does produce a more specific value for distance, considering length, position, and sequence, whilst also considering the weightings and groupings of the activities.
6. Case studies
Our research applies the eight previously discussed metrics and the modified algorithm to two small samples and the full case study dataset. These samples are very basic to allow the reader to closely examine the intricate differences that appear due to the inclusion of the weighting and rankings. Furthermore, sample 1 and sample 2 are easily assigned to two and three groups respectively, to display that the obvious solution is found in a simple example, and to provide the reader with confidence when applying this to more complex data. Although these samples are artificially constructed, they reflect the small differences between strings seen in practice.
Sample 1 consists of 10 pathways: ABC, ABCK, ABCL, ABCO, ABKC, DIJ, DIJK, DIJL, DIJO, and DIKJ. These were chosen as A,B,C and D,I,J are the highest and lowest ranked activities respectively.
Sample 2 consists of 16 pathways, the same 10 as in sample 1, plus a further six which display the complexity of allowed swaps between slight differences within the pathway. These are: ‘ABKOCEF’, ‘ABOKCEF’, ‘ABKOCFE’, ‘ABOKCFE’, ‘ABKECOF’, ‘ABKCOEF’.
Two examples of the modification are included in the analysis using penalty values g 2, s 2, ns 5 and g 9, s 2, ns 19, which will be referred to as MNW_1225 and MNW_19219 respectively.
The analysis for the two samples is as follows: Firstly, the distances between all the points are calculated using the ten previously discussed metrics, and then plotted to demonstrated how the modified algorithm allows for more separation in the data. Secondly, the k-medoids clustering is run for k [2,8], where the use of the silhouette scores both confirms point one and displays that the modified algorithm outperforms most of the other metrics. The findings are displayed in a table, which contains the results for k 2 and then the best performing k (if k 2 was best, then the second best is displayed), which includes the number of iterations.
The following python libraries were used: textdistance [105] was used for calculations of the eight other distance metrics, pyclustering [115] was used for the k-medoids clustering and scikit-learn was used for the calculation of the silhouette score [116].
6.1. Sample 1: 10 pathways
Fig. A.17 (Appendix) displays a comparison of the distances between the pathways in sample 1 for each of the eight measures discussed in Section 4 and the two examples of the modified algorithm (MNW_1225 and MNW_19219).
To aid understanding of Fig. A.17 (Appendix), firstly the distance from each point to itself is 0, and therefore the colour of the dot at x 0 for each pathway on y is the colour that represents that pathway e.g. pathway DIJ is represented by the red dot. Furthermore, all pathways beginning with A are from the blue colour pallet, and those beginning with D are from the red colour pallet.
The -axis displays the pathway which all others are being compared to and the -axis displays the distance from that pathway. For example, in the top left graph considering the Levenshtein distance, the distance from ABC (light blue) to ABKC (dark green) is 1.
In all eight of these graphs in Fig. A.17 (Appendix), if you split the graph horizontally between ABKC and DIJ, and overlaid the two halves, you can see that the distances are exactly the same, and reflects the lack of uniqueness. There is also little separation between the blue and red groups, with the exception of the Jaro and Jaro–Winkler graphs, where this is more clear.
Now considering the bottom two graphs in Fig. A.17 (Appendix), which display the modified algorithm (penalty values g 2, s 2 and ns 5 on the left and g 9 s 2 and ns 19 on the right). It can clearly be seen that this algorithm allows for more uniqueness and greater separation between the colour groups, as desired.
To confirm that this is reflected in the clustering, k-medoids clustering was performed for all ten metrics, the results for which are displayed in Table 5. The initial centroids were chosen as 0: ‘ABC’ and 5: ‘DIJ’. It is expected that the clustering algorithm should keep ‘ABC’ and ‘DIJ’ as the centroids.
Table 5 displays the expected results, with the only measures that surpass the modified Needleman–Wunsch in silhouette score is the Jaro and Jaro–Winkler.
Table 5.
Clustering of Sample 1, for all ten distances.
| Name | Centroids | Number per cluster | Silhouette score |
|---|---|---|---|
| Levenshtein | 0, 5 | 5, 5 | 0.65789 |
| Damerau–Levenshtein | 0, 5 | 5, 5 | 0.70614 |
| Jaro | 0, 5 | 5, 5 | 0.85602 |
| Jaro–Winkler | 0, 5 | 5, 5 | 0.88333 |
| Needleman–Wunsch | 0, 5 | 5, 5 | 0.65789 |
| Jaccard | 0, 5 | 5, 5 | 0.43500 |
| Cosine | 0, 5 | 5, 5 | 0.58577 |
| LCS | 0, 5 | 5, 5 | 0.73099 |
| MNW_1225 | 0, 5 | 5, 5 | 0.76128 |
| MNW_19219 | 0, 5 | 5, 5 | 0.77464 |
6.2. Sample 2: 16 pathways
Similarly to the previous subsection, Fig. A.18 (Appendix) displays a comparison of the distances between the pathways in sample 2 for each of the eight measures discussed in Section 4 and the two examples of the modified algorithm (MNW_1225 and MNW_19219).
In this sample, it is logical to assume that three clusters would be appropriate, the same two as in sample 1 and a further one containing the extra six pathways. Therefore Fig. A.18 (Appendix) should be examined for the appearance of three distinct groups.
This is actually not as clear cut as it was with sample 1 (in relation to two groups). In the majority of the metrics, it is difficult to find the clear groups one is expecting (one group of red, one group of blue and another of yellow). Again the distinction is more clear in the modified algorithm, especially with the penalty values g 9, s 2 and ns 19 (as previously stated). This further confirms that the modified algorithm allows for better distinction between pathways.
To confirm if this is reflected in the clustering, the same analysis was run as that described for sample 1, where the initial centroids were chosen as 0: ‘ABC’, 5: ‘DIJ’ and 10: ‘ABKOCEF’, and for k [2,3]. It is expected that the clustering algorithm should keep the same centroids, and that three clusters would be chosen.
Table 6 confirms that the modified algorithm performs equally well as the other metrics, and selects the expected centroids, which is not the case with some of the other metrics.
Table 6.
Clustering of Sample 2, for all ten distances.
| Name | Centroids k 2 |
Number per cluster k 2 |
Silhouette score k 2 |
Centroids k 3 |
Number per cluster k 3 |
Silhouette score k 3 |
|---|---|---|---|---|---|---|
| Levenshtein | 4, 5 | 11, 5 | 0.51433 | 0, 5, 10 | 6, 5, 5 | 0.45234 |
| Damerau–Levenshtein | 4, 5 | 11, 5 | 0.54800 | 0, 5, 10 | 5, 5, 6 | 0.62230 |
| Jaro | 5, 10 | 5, 11 | 0.80971 | 0, 5, 10 | 5, 5, 6 | 0.58120 |
| Jaro–Winkler | 4, 5 | 11, 5 | 0.84600 | 0, 5, 10 | 5, 5, 6 | 0.60252 |
| Needleman–Wunsch | 4, 5 | 11, 5 | 0.50676 | 0, 5, 10 | 6, 5, 5 | 0.43148 |
| Jaccard | 0, 5 | 11, 5 | 0.30516 | 0, 4, 5 | 4, 7, 5 | 0.32543 |
| Cosine | 0, 5 | 11, 5 | 0.44807 | 0, 4, 5 | 4, 7, 5 | 0.43025 |
| LCS | 4, 5 | 11, 5 | 0.56353 | 0, 5, 10 | 5, 5, 6 | 0.67356 |
| MNW_1225 | 4, 5 | 11, 5 | 0.64700 | 0, 5, 10 | 5, 5, 6 | 0.53059 |
| MNW_19219 | 4, 5 | 11, 5 | 0.67874 | 0, 5, 10 | 5, 5, 6 | 0.59195 |
It was expected that three clusters should be chosen, however, examining the silhouette scores it appears that in most cases the score for k 2 is closer to 1 than in k 3, suggesting that two clusters is better. This indicates that possibly the silhouette score is not the most appropriate measure to use, and care is needed when selecting the appropriate number of clusters.
In conclusion both samples display that the modified algorithm does enhance the differences between strings based on user specific characteristics, and performs equally well, if not better, than the currently used metrics.
6.3. Full data
This section applies the eight measures discussed in Section 4 and the two examples of the modified algorithm (MNW_1225 and MNW_19219) to the full data set which was discussed in Section 3. As a recap, there are 2350 patients and 1019 different pathways considering the 15 activities. We have applied k-medoids clustering to the data, considering values of k [2,8] and initial centroids as [0,1,2,3,4,5,6,7].
Table 7 shows the results for k 2 and Table 8 for the (next) best value of k (in terms of silhouette score). Both tables also include the medoids that were chosen and the number of pathways assigned to each of those cluster medoids.
Table 7.
Results of full data clustering for k 2.
| Name | Iter | Medoids | Pathways per cluster | Score |
|---|---|---|---|---|
| Levenshtein | 3 | KAOBC, AKBMCEGFH | 663, 356 | 0.15604 |
| Damerau–Levenshtein | 2 | KAOBCD, AKBMCEGFH | 676, 343 | 0.17549 |
| Jaro | 3 | KAOBLCD, AKOBMCEGFH | 409, 610 | 0.18343 |
| Jaro–Winkler | 3 | KAOBCD, AKOBMCEGFH | 445, 574 | 0.17542 |
| Needleman–Wunsch | 2 | AOBC, AOBCEGFH | 727, 292 | 0.16743 |
| Jaccard | 2 | KAOBNLCGH, KAOBMCEGFH | 650, 369 | 0.04297 |
| Cosinea | 2 | KAOBNLCGDH, KAOBMCEFGH | 649, 369 | 0.06854 |
| LCS | 2 | KAOBCD, KAOBCEGFH | 510, 509 | 0.24305 |
| MNW_1225 | 2 | KABC, AOBCEGFH | 715, 304 | 0.14303 |
| MNW_19219 | 2 | AOBC, AKOBCEGFH | 676, 343 | 0.17976 |
For cosine, the pathway consisting of just activity B had to be removed, as it caused division by 0.
Table 8.
Results of full data clustering for best k (exclusing k 2).
| Name | k | Iter | Medoids | Pathways per cluster | Score |
|---|---|---|---|---|---|
| Levenshtein | 3 | 4 | KAOBC, AKBMCEGFH, ABCO | 541, 348, 130 | 0.06964 |
| Damerau–Levenshtein | 3 | 4 | KAOBCD, AKBMCEGFH, ABKOC | 519, 333, 167 | 0.09724 |
| Jaro | 3 | 3 | KAOBLCD, KAOBMCEGFH, ABKOC | 315, 503, 201 | 0.16252 |
| Jaro–Winkler | 3 | 3 | KAOBLCD, KAOBMCEGFH, ABKOC | 308, 487, 224 | 0.16254 |
| Needleman–Wunsch | 3 | 2 | AOBC, AOBCEGFH, ABCO | 582, 279, 158 | 0.06689 |
| Jaccard | 7 | 3 | KAOBNLC, KAOBMCEGFH, KAOBC, AKOBNC, KABNCOEF, AOKBMC, BKAOCGH | 137, 229, 117, 194, 84, 113, 145 | 0.05322 |
| Cosinea | 7 | 4 | KANOMBCEFD, KAOBMCEFGH, ABC, AKOBC, KABMCO, AOKBC, BKAOCEGFH | 172, 219, 45, 207, 122, 89, 164 | 0.08812 |
| LCS | 3 | 3 | KAOBCD, KAOBCEGFH, ABKC | 408, 509, 102 | 0.14132 |
| MNW_1225 | 3 | 4 | KAOBC, AOBCEGFH, AKBC | 384, 199, 436 | 0.13354 |
| MNW_19219 | 3 | 4 | AKOBC, AOKBCEGFH, KAOBC | 403, 229, 387 | 0.14860 |
For cosine, the pathway consisting of just activity B had to be removed, as it caused division by 0.
The run time for each distance matrix was under 10 min, where the modified Needleman–Wunsch algorithm performed within the range of the other metrics.
Both Table 7, Table 8 shows that the silhouette scores for all 10 measure are quite poor. However, the silhouette score for the Needleman–Wunsch modification, with both sets of penalty values, is on par with the other metrics for k 2 (with the exception of LCS), and surpass most of the other measure, with the exception of the jaro metrics when considering the second best value for k. This shows that for a full dataset, the modification performs equally as well, if not better, than the frequently used metrics when considering the silhouette score.
Furthermore, the metrics as a whole do not come to a consensus on a solution for the clustering as each of the metrics produce different results when considering the centroids selected and the number of pathways assigned to each cluster. Even when the same medoids are selected, the number of pathways assigned to those medoids clusters are not the same. This confirms that careful consideration is needed when selecting the distance metric, and what differences are to be highlighted.
7. Conclusions
A recent review of the literature [5] highlighted that clustering is a popular method for pathway discovery, however the distance metrics that apply to string data are lacking in uniqueness and do not hold any context. The review [5] also highlighted the lack of techniques that consider both information gathered from data and experts together, when developing a clinical pathway.
As a result, this paper discusses the development of a new distance metric, modified from the Needleman–Wunsch dynamic programming algorithm, that is specifically designed for clustering, and allows for expert interaction through the use of groupings and rankings of activities.
The modified metric was compared against eight other popular metrics, where it performed equally well, if not better, when used with k-medoids clustering. This comparison further highlight that each of the metrics produce different results and as such, confirms the hypothesis that careful consideration is needed when selecting a string metric.
Care needs to be taken when selecting the penalty values along with the rankings and groupings as the values selected here will change the results produced by the clustering. Further work could be considered here to aid the user in how to most effectively select the values here.
This method can support clinical pathway redesign or optimisation by initially providing a more time efficient process for mapping clinical pathways through combining both data and expert knowledge. As a result of combining both data and expert knowledge the clusters should be more clinically relevant using the modified Needleman–Wunsch metric due to the rankings and groupings feature.
From a clinical perspective, the resulting clusters enable deeper examination of the activity interactions which can help to highlight patterns that were previously undetectable when looking at the data as a whole. This can support decision makers in the pathway redesign process which could lead to reducing delays to diagnosis and improved outcomes. This can also allow decision makers to prospectively consider the capacity required at activities due to a awareness of preceding activity demand.
To further facilitate the use of this method, the modified algorithm (including the rankings and groupings feature) have been built into a decision support tool, Sim.Pro.Flow, which is available open access on Github [117]. Sim.Pro.Flow supports further exploration of the resulting clusters through allowing visualisation of the pathways as a network and allowing the pathways to be explored through a discrete event simulation.
Overall, the modified metric paves the way to adding more context to string distances, and bridges the gap between data and expert interaction.
Further work
The following areas have been identified as further work:
-
•
Smart selection of penalty values: Machine learning techniques could be utilised to select penalty values which highlight various relationships as appropriate.
-
•
Modify the Jaro distance metric [108], [109] using the same idea, as it produces good silhouette scores.
-
•
Consider a final adjustment to the modified value to account for the total number of letters that appear in both strings i.e. divide final value by the number of letters appearing in both.
-
•
Further sensitivity analysis to aid guidance in selecting penalty values, rankings and groups.
-
•
Investigation of the impact of allowing groupings of singular activities, and how this could be used effectively.
CRediT authorship contribution statement
Emma Aspland: Methodology, Formal analysis, Software, Writing - original draft. Paul R. Harper: Conceptualisation, Supervision, Writing - review & editing. Daniel Gartner: Conceptualisation, Supervision, Writing - review & editing. Philip Webb: Data curation, Supervision, Writing - review & editing. Peter Barrett-Lee: Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors would like to thank Velindre Cancer Centre for supporting this work in many ways. The authors would like to specifically acknowledge Nikoleta Glynatsi, Geraint Palmer and Henry Wilde for their support with coding. Furthermore, the authors sincerely thank the associate editor and the anonymous referees for their careful review and excellent suggestions for improvement of this paper.
Funding
This work has resulted from research funded by a Cancer Research UK grant ‘Analysis and Modelling of a Single Cancer Pathway Diagnostics’ (Early Diagnosis Project Award A27882) and from a KESS2 grant under the project title “Smart Simulation and Modelling of Complex Cancer Systems”. Knowledge Economy Skills Scholarships (KESS) is a pan-Wales higher level skills initiative led by Bangor University on behalf of the HE sector in Wales. It is part funded by the Welsh Government’s European Social Fund (ESF) convergence programme for West Wales and the Valleys.
Appendix.
References
- 1.World Health Organisation . 2018. Latest global cancer data. https://www.who.int/cancer/PRGlobocanFinal.pdf. [Google Scholar]
- 2.Snyder M. Big data and health. Lancet Digit. Health. 2019;1(6):e252–e254. doi: 10.1016/S2589-7500(19)30109-8. [DOI] [PubMed] [Google Scholar]
- 3.Zhang Y., Padman R., Patel N. Paving the cowpath: Learning and visualizing clinical pathways from electronic health record data. J. Biomed. Inform. 2015;58:186–197. doi: 10.1016/j.jbi.2015.09.009. cited By 10. [DOI] [PubMed] [Google Scholar]
- 4.Fauman M. Do physicians use practice guidelines? Psychiatr. Times. 2006:13. [Google Scholar]
- 5.Aspland E.L., Gartner D., Harper P.R. Clinical pathway modelling: A literature. Health Syst. 2019 doi: 10.1080/20476965.2019.1652547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.A. Novikov, PyClustering: Data Mining Library, Open J. J. Open Source Softw. (2019).
- 7.Vogt V., Scholz S.M., Sundmacher L. Applying sequence clustering techniques to explore practice-based ambulatory care pathways in insurance claims data. Eur. J. Public Health. 2018;28(2):214–219. doi: 10.1093/eurpub/ckx169. cited By 0. [DOI] [PubMed] [Google Scholar]
- 8.Chen J., Sun L., Guo C., Wei W., Xie Y. A data-driven framework of typical treatment process extraction and evaluation. J. Biomed. Inform. 2018;83:178–195. doi: 10.1016/j.jbi.2018.06.004. cited By 0. [DOI] [PubMed] [Google Scholar]
- 9.Deja R., Froelich W., Deja G., Wakulicz-Deja A. Hybrid approach to the generation of medical guidelines for insulin therapy for children. Inform. Sci. 2017;384:157–173. cited By 2. [Google Scholar]
- 10.Funkner A.A., Yakovlev A.N., Kovalchuk S.V. Vol. 119. 2017. Towards evolutionary discovery of typical clinical pathways in electronic health records; pp. 234–244. cited By 1. [Google Scholar]
- 11.Funkner A.A., Yakovlev A.N., Kovalchuk S.V. Vol. 121. 2017. Data-driven modeling of clinical pathways using electronic health records; pp. 835–842. cited By 0. [Google Scholar]
- 12.Guo S., Xu K., Zhao R., Gotz D., Zha H., Cao N. Eventthread: Visual summarization and stage analysis of event sequence data. IEEE Trans. Vis. Comput. Graphics. 2018;24(1):56–65. doi: 10.1109/TVCG.2017.2745320. cited By 0. [DOI] [PubMed] [Google Scholar]
- 13.Kovalchuk S.V., Funkner A.A., Metsker O.G., Yakovlev A.N. Simulation of patient flow in multiple healthcare units using process and data mining techniques for model identification. J. Biomed. Inform. 2018;82:128–142. doi: 10.1016/j.jbi.2018.05.004. cited By 0. [DOI] [PubMed] [Google Scholar]
- 14.Lakshmanan G.T., Rozsnyai S., Wang F. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 8094. 2013. Investigating clinical care pathways correlated with outcomes; pp. 323–338. (LNCS). cited By 23. [Google Scholar]
- 15.Lismont J., Janssens A.-S., Odnoletkova I., vanden Broucke S., Caron F., Vanthienen J. A guide for the application of analytics on healthcare processes: A dynamic view on patient pathways. Comput. Biol. Med. 2016;77:125–134. doi: 10.1016/j.compbiomed.2016.08.007. cited By 0. [DOI] [PubMed] [Google Scholar]
- 16.Najjar A., Reinharz D., Girouard C., Gagné C. A two-step approach for mining patient treatment pathways in administrative healthcare databases. Artif. Intell. Med. 2018;87:34–48. doi: 10.1016/j.artmed.2018.03.004. cited By 2. [DOI] [PubMed] [Google Scholar]
- 17.Shen C.-P., Jigjidsuren C., Dorjgochoo S., Chen C.-H., Chen W.-H., Hsu C.-K., Wu J.-M., Hsueh C.-W., Lai M.-S., Tan C.-T., Altangerel E., Lai F. A data-mining framework for transnational healthcare system. J. Med. Syst. 2012;36(4):2565–2575. doi: 10.1007/s10916-011-9729-7. cited By 7. [DOI] [PubMed] [Google Scholar]
- 18.Tsumoto S., Iwata H., Hirano S., Tsumoto Y. Similarity-based behavior and process mining of medical practices. Future Gener. Comput. Syst. 2014;33:21–31. cited By 23. [Google Scholar]
- 19.Helbig K., Römer M., Mellouli T. Lecture Notes in Computer Science. vol. 9253. 2015. A clinical pathway mining approach to enable scheduling of hospital relocations and treatment services; pp. 242–250. ((including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). cited By 0. [Google Scholar]
- 20.Huang Z., Dong W., Ji L., Yin L., Duan H. On local anomaly detection and analysis for clinical pathways. Artif. Intell. Med. 2015;65(3):167–177. doi: 10.1016/j.artmed.2015.09.001. cited By 0. [DOI] [PubMed] [Google Scholar]
- 21.Huang Z., Gan C., Lu X., Huan H. Vol. 192. 2013. Mining the changes of medical behaviors for clinical pathways; pp. 117–121. cited By 4. [PubMed] [Google Scholar]
- 22.Michalowski W., Wilk S., Thijssen A., Li M. Using a Bayesian belief network model to categorize length of stay for radical prostatectomy patients: Using a Bayesian belief network to categorize LOS. Health Care Manag. Sci. 2006;9(4):341–348. doi: 10.1007/s10729-006-9998-8. cited By 3. [DOI] [PubMed] [Google Scholar]
- 23.Zhang Y., Padman R. Data-driven clinical and cost pathways for chronic care delivery. Amer. J. Manag. Care. 2016;22(12):816–820. cited By 1. [PubMed] [Google Scholar]
- 24.Hira Z.M., Gillies D.F. Identifying significant features in cancer methylation data using gene pathway segmentation. Cancer Inform. 2016;15:189–198. doi: 10.4137/CIN.S39859. cited By 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yin L., Dong W., Huang Z., Ji L., Lv X., Duan H. On detecting the changes of medical behaviors in clinical pathways. Chin. J. Biomed. Eng. 2015;34(3):272–280. cited By 0. [Google Scholar]
- 26.Xu X., Jin T., Wei Z., Lv C., Wang J. 2016. Tcpm: Topic-based clinical pathway mining; pp. 292–301. cited By 3. [Google Scholar]
- 27.Xu X., Jin T., Wei Z., Wang J. 2017. Incorporating domain knowledge into clinical goal discovering for clinical pathway mining; pp. 261–264. cited By 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Huang Z., Dong W., Bath P., Ji L., Duan H. On mining latent treatment patterns from electronic medical records. Data Min. Knowl. Discov. 2015;29(4):914–949. cited By 16. [Google Scholar]
- 29.Huang Z., Dong W., Ji L., He C., Duan H. Incorporating comorbidities into latent treatment pattern mining for clinical pathways. J. Biomed. Inform. 2016;59:227–239. doi: 10.1016/j.jbi.2015.12.012. cited By 4. [DOI] [PubMed] [Google Scholar]
- 30.Huang Z., Dong W., Duan H., Li H. Similarity measure between patient traces for clinical pathway analysis: Problem, method, and applications. IEEE J. Biomed. Health Inf. 2014;18(1):4–14. doi: 10.1109/JBHI.2013.2274281. cited By 16. [DOI] [PubMed] [Google Scholar]
- 31.Huang Z., Dong W., Ji L., Gan C., Lu X., Duan H. Discovery of clinical pathway patterns from event logs using probabilistic topic models. J. Biomed. Inform. 2014;47:39–57. doi: 10.1016/j.jbi.2013.09.003. cited By 35. [DOI] [PubMed] [Google Scholar]
- 32.Huang Z., Dong W., Ji L., Duan H. Predictive monitoring of clinical pathways. Expert Syst. Appl. 2016;56:227–241. cited By 1. [Google Scholar]
- 33.Huang Z., Lu X., Duan H. Latent treatment pattern discovery for clinical processes. J. Med. Syst. 2013;37(2) doi: 10.1007/s10916-012-9915-2. cited By 19. [DOI] [PubMed] [Google Scholar]
- 34.Huang Z., Lu X., Duan H. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 7885. 2013. Similarity measuring between patient traces for clinical pathway analysis; pp. 268–272. (LNAI). cited By 4. [Google Scholar]
- 35.Xu X., Jin T., Wei Z., Wang J. Incorporating topic assignment constraint and topic correlation limitation into clinical goal discovering for clinical pathway mining. J. Healthc. Eng. 2017;2017 doi: 10.1155/2017/5208072. cited By 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Yin L., Huang Z., Dong W., He C., Duan H. Utilizing electronic medical records to discover changing trends of medical behaviors over time. Methods Inf. Med. 2017;56(MethodsOpen):e49–e66. doi: 10.3414/ME16-01-0047. cited By 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Yang W.-S., Hwang S.-Y. A process-mining framework for the detection of healthcare fraud and abuse. Expert Syst. Appl. 2006;31(1):56–58. cited By 91. [Google Scholar]
- 38.Arnolds I.V., Gartner D. Improving hospital layout planning through clinical pathway mining. Ann. Oper. Res. 2018;263:453–477. cited By 0. [Google Scholar]
- 39.Dagliati A., Sacchi L., Zambelli A., Tibollo V., Pavesi L., Holmes J.H., Bellazzi R. Temporal electronic phenotyping by mining careflows of breast cancer patients. J. Biomed. Inform. 2017;66:136–147. doi: 10.1016/j.jbi.2016.12.012. cited By 1. [DOI] [PubMed] [Google Scholar]
- 40.Gartner D., Arnolds I.V., Nickel S. Improving hospital-wide patient scheduling decisions by clinical pathway mining. Stud. Health Technol. Inform. 2015;216:1066. cited By 0. [PubMed] [Google Scholar]
- 41.Perer A., Wang F., Hu J. Mining and exploring care pathways from electronic medical records with visual analytics. J. Biomed. Inform. 2015;56:369–378. doi: 10.1016/j.jbi.2015.06.020. cited By 15. [DOI] [PubMed] [Google Scholar]
- 42.Smedley N.F., Ellingson B.M., Cloughesy T.F., Hsu W. Longitudinal patterns in clinical and imaging measurements predict residual survival in glioblastoma patients. Sci. Rep. 2018;8(1) doi: 10.1038/s41598-018-32397-z. cited By 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Syed H., Das A.K. Lecture Notes in Computer Science. vol. 9105. 2015. Identifying chemotherapy regimens in electronic health record data using interval-encoded sequence alignment; pp. 143–147. ((including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). cited By 2. [Google Scholar]
- 44.Tolarczyk A., Siwek K. 2016. Sequential pattern recognition for medical records analysis. cited By 0. [Google Scholar]
- 45.K. Uragaki, T. Hosaka, Y. Arahori, M. Kushima, T. Yamazaki, K. Araki, H. Yokota, Sequential pattern mining on electronic medical records with handling time intervals and the efficacy of medicines, Vol. 2016-August, 2016, pp. 20–25, cited By 3.
- 46.Dauxais Y., Guyet T., Gross-Amblard D., Happe A. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 10259. 2017. Discriminant chronicles mining: Application to care pathways analytics; pp. 234–244. (LNAI). cited By 0. [Google Scholar]
- 47.Li X., Liu H., Mei J., Yu Y., Xie G. Mining temporal and data constraints associated with outcomes for care pathways. Stud. Health Technol. Inform. 2015;216:711–715. cited By 0. [PubMed] [Google Scholar]
- 48.Caron F., Vanthienen J., Vanhaecht K., Van Limbergen E., Deweerdt J., Baesens B. A process mining-based investigation of adverse events in care processes. Health Inf. Manag. J. 2014;43(1):16–25. doi: 10.1177/183335831404300103. cited By 5. [DOI] [PubMed] [Google Scholar]
- 49.Erdogan T.G., Tarhan A. A goal-driven evaluation method based on process mining for healthcare processes. Appl. Sci. (Switzerland) 2018;8(6) cited By 0. [Google Scholar]
- 50.Huang H., Jin T., Wang J. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 10219. 2017. Extracting clinical-event-packages from billing data for clinical pathway mining; pp. 19–31. (LNCS). cited By 0. [Google Scholar]
- 51.Huang Z., Lu X., Duan H. On mining clinical pathway patterns from medical behaviors. Artif. Intell. Med. 2012;56(1):35–50. doi: 10.1016/j.artmed.2012.06.002. cited By 52. [DOI] [PubMed] [Google Scholar]
- 52.Mans R., Schonenberg H., Leonardi G., Panzarasa S., Cavallini A., Quaglini S., Van Der Aalst W. Process mining techniques: An application to stroke care. Stud. Health Technol. Inform. 2008;136:573–578. cited By 65. [PubMed] [Google Scholar]
- 53.Partington A., Wynn M., Suriadi S., Ouyang C., Karnon J. Process mining for clinical processes: A comparative analysis of four australian hospitals. ACM Trans. Manag. Inf. Syst. 2015;5(4) cited By 22. [Google Scholar]
- 54.Rismanchian F., Lee Y.H. Process mining–based method of designing and optimizing the layouts of emergency departments in hospitals. Health Environ. Res. Des. J. 2017;10(4):105–120. doi: 10.1177/1937586716674471. cited By 0. [DOI] [PubMed] [Google Scholar]
- 55.Stefanini A., Aloini D., Dulmin R., Mininno V. HEALTHINF 2016 - 9th International Conference on Health Informatics, Proceedings; Part of 9th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2016. 2016. Linking diagnostic-related groups (DRGs) to their processes by process mining; pp. 438–443. cited By 0. [Google Scholar]
- 56.Xu X., Jin T., Wang J. 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services, Healthcom 2016. 2016. Summarizing patient daily activities for clinical pathway mining. cited By 1. [Google Scholar]
- 57.Argiento R., Guglielmi A., Lanzarone E., Nawajah I. A Bayesian framework for describing and predicting the stochastic demand of home care patients. Flex. Serv. Manuf. J. 2016;28(1–2):254–279. cited By 5. [Google Scholar]
- 58.Fenton N., Neil M. Comparing risks of alternative medical diagnosis using Bayesian arguments. J. Biomed. Inform. 2010;43(4):485–495. doi: 10.1016/j.jbi.2010.02.004. cited By 15. [DOI] [PubMed] [Google Scholar]
- 59.Gartner D., Padman R. Improving hospital-wide early resource allocation through machine learning. Stud. Health Technol. Inform. 2015;216:315–319. cited By 0. [PubMed] [Google Scholar]
- 60.Liu R., Srinivasan R.V., Zolfaghar K., Chin S.-C., Roy S.B., Hasan A., Hazel D. IEEE International Conference on Data Mining Workshops, ICDMW, Vol. 2015-January. 2015. Pathway-finder: An interactive recommender system for supporting personalized care pathways; pp. 1219–1222. cited By 1. [Google Scholar]
- 61.Alharbi A., Bulpitt A., Johnson O.A. Towards unsupervised detection of process models in healthcare. Stud. Health Technol. Inform. 2018;247:381–385. cited By 0. [PubMed] [Google Scholar]
- 62.Baker K., Dunwoodie E., Jones R.G., Newsham A., Johnson O., Price C.P., Wolstenholme J., Leal J., McGinley P., Twelves C., Hall G. Process mining routinely collected electronic health records to define real-life clinical pathways during chemotherapy. Int. J. Med. Inform. 2017;103:32–41. doi: 10.1016/j.ijmedinf.2017.03.011. cited By 1. [DOI] [PubMed] [Google Scholar]
- 63.McClean S., Garg L., Meenan B., Millard P. 2007. Using Markov models to find interesting patient pathways; pp. 713–718. cited By 7. [Google Scholar]
- 64.McClean S., Young T., Bustard D., Millard P., Barton M. 2008 4th International IEEE Conference Intelligent Systems, IS 2008, Vol. 1. 2008. Discovery of value streams for lean healthcare; pp. 32–38. cited By 0. [Google Scholar]
- 65.Du G., Jiang Z., Diao X., Yao Y. Knowledge extraction algorithm for variances handling of CP using integrated hybrid genetic double multi-group cooperative PSO and DPSO. J. Med. Syst. 2012;36(2):979–994. doi: 10.1007/s10916-010-9562-4. cited By 5. [DOI] [PubMed] [Google Scholar]
- 66.Huang Z., Lu X., Duan H., Fan W. Summarizing clinical pathways from event logs. J. Biomed. Inform. 2013;46(1):111–127. doi: 10.1016/j.jbi.2012.10.001. cited By 35. [DOI] [PubMed] [Google Scholar]
- 67.Kashner T.M., Carmody T.J., Suppes T., Rush A.J., Crismon M.L., Miller A.L., Toprac M., Trivedi M. Catching up on health outcomes: The texas medication algorithm project. Health Serv. Res. 2003;38(1 I):311–331. doi: 10.1111/1475-6773.00117. cited By 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Prodel M., Augusto V., Xie X., Jouaneton B., Lamarsalle L. IEEE International Conference on Automation Science and Engineering, Vol. 2015-October. 2015. Discovery of patient pathways from a national hospital database using process mining and integer linear programming; pp. 1409–1414. cited By 4. [Google Scholar]
- 69.Huang Z., Bao Y., Dong W., Lu X., Duan H. Online treatment compliance checking for clinical pathways. J. Med. Syst. 2014;38(10) doi: 10.1007/s10916-014-0123-0. cited By 2. [DOI] [PubMed] [Google Scholar]
- 70.Mohammed O., Benlamri R. Developing a semantic web model for medical differential diagnosis recommendation. J. Med. Syst. 2014;38(10) doi: 10.1007/s10916-014-0079-0. cited By 7. [DOI] [PubMed] [Google Scholar]
- 71.Bettencourt-Silva J.H., Mannu G.S., de la Iglesia B. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 9605. 2016. Visualisation of integrated patient-centric data as pathways: Enhancing electronic medical records in clinical practice; pp. 99–124. (LNCS). cited By 0. [Google Scholar]
- 72.Cook D.A., Sorensen K.J., Linderbaum J.A., Pencille L.J., Rhodes D.J. Information needs of generalists and specialists using online best-practice algorithms to answer clinical questions. J. Amer. Med. Inform. Assoc. 2017;24(4):754–761. doi: 10.1093/jamia/ocx002. cited By 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Happe A., Drezen E. A visual approach of care pathways from the french nationwide snds database – from population to individual records: the ePEPS toolbox. Fundam. Clin. Pharmacol. 2018;32(1):81–84. doi: 10.1111/fcp.12324. cited By 1. [DOI] [PubMed] [Google Scholar]
- 74.Zhang Y., Padman R. An interactive platform to visualize data-driven clinical pathways for the management of multiple chronic conditions. Stud. Health Technol. Inform. 2017;245:672–676. cited By 0. [PubMed] [Google Scholar]
- 75.J. Bowles, M.B. Caminati, S. Cha, An integrated framework for verifying multiple care pathways, Vol. 2018-January, 2018, pp. 1–8, cited By 0.
- 76.Ramos-Merino M., Álvarez Sabucedo L.M., Santos-Gago J.M., Sanz-Valero J. A BPMN based notation for the representation of workflows in hospital protocols. J. Med. Syst. 2018;42(10) doi: 10.1007/s10916-018-1034-2. cited By 0. [DOI] [PubMed] [Google Scholar]
- 77.Yan H., Van Gorp P., Kaymak U., Lu X., Ji L., Chiau C.C., Korsten H.H.M., Duan H. Aligning event logs to task-time matrix clinical pathways in BPMN for variance analysis. IEEE J. Biomed. Health Inf. 2018;22(2):311–317. doi: 10.1109/JBHI.2017.2753827. cited By 0. [DOI] [PubMed] [Google Scholar]
- 78.Bruzzi S., Landa P., Tànfani E., Testi A. Conceptual modelling of the flow of frail elderly through acute-care hospitals: An evidence-based management approach. Manag. Decis. 2018;56(10):2101–2124. cited By 1. [Google Scholar]
- 79.Furuhata H., Araki K., Ogawa T., Ikeda M. Effect on completion of clinical pathway for improving clinical indicator: Cases of hospital stay, mortality rate, and comprehensive-volume ratio. J. Med. Syst. 2017;41(12) doi: 10.1007/s10916-017-0857-6. cited By 0. [DOI] [PubMed] [Google Scholar]
- 80.Han B., Jiang L., Cai H. 2011. Abnormal process instances identification method in healthcare environment; pp. 1387–1392. cited By 4. [Google Scholar]
- 81.Konrad R., Tulu B., Lawley M. Monitoring adherence to evidence-based practices: A method to utilize hl7 messages from hospital information systems. Appl. Clin. Inform. 2013;4(1):126–143. doi: 10.4338/ACI-2012-06-RA-0026. cited By 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Lin F.-R., Chou S.-C., Pan S.-M., Chen Y.-M. Mining time dependency patterns in clinical pathways. Int. J. Med. Inform. 2001;62(1):11–25. doi: 10.1016/s1386-5056(01)00126-5. cited By 48. [DOI] [PubMed] [Google Scholar]
- 83.Liu J., Huang Z., Lu X., Duan H. 2014. An ontology-based real-time monitoring approach to clinical pathway; pp. 756–761. cited By 0. [Google Scholar]
- 84.Maheshwari K., Cywinski J., Mathur P., Cummings III K.C., Avitsian R., Crone T., Liska D., Campion F.X., Ruetzler K., Kurz A. Identify and monitor clinical variation using machine intelligence: a pilot in colorectal surgery. J. Clin. Monit. Comput. 2019 doi: 10.1007/s10877-018-0200-x. cited By 0. [DOI] [PubMed] [Google Scholar]
- 85.Noro A., Poss J.W., Hirdes J.P., Finne-Soveri H., Ljunggren G., Björnsson J., Schroll M., Jonsson P.V. Method for assigning priority levels in acute care (MAPLe-AC) predicts outcomes of acute hospital care of older persons - a cross-national validation. BMC Med. Inform. Decis. Mak. 2011;11(1) doi: 10.1186/1472-6947-11-39. cited By 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Wang T., Tian X., Yu M., Qi X., Yang L. Stage division and pattern discovery of complex patient care processes. J. Syst. Sci. Compl. 2017;30(5):1136–1159. cited By 0. [Google Scholar]
- 87.Xu W., Zhu Y., Geng Y. Development of an open metadata schema for clinical pathway (opencp) in China. Methods Inf. Med. 2018;57(4):159–167. doi: 10.3414/ME17-01-0110. cited By 0. [DOI] [PubMed] [Google Scholar]
- 88.Bakker M., Tsui K.-L. Dynamic resource allocation for efficient patient scheduling: A data-driven approach. J. Syst. Sci. Syst. Eng. 2017;26(4):448–462. cited By 0. [Google Scholar]
- 89.Comans T., Raymer M., O’Leary S., Smith D., Scuffham P. Cost-effectiveness of a physiotherapist-led service for orthopaedic outpatients. J. Health Serv. Res. Policy. 2014;19(4):216–223. doi: 10.1177/1355819614533675. cited By 8. [DOI] [PubMed] [Google Scholar]
- 90.Du G., Jiang Z., Diao X., Ye Y., Yao Y. Variances handling method of clinical pathways based on t-s fuzzy neural networks with novel hybrid learning algorithm. J. Med. Syst. 2012;36(3):1283–1300. doi: 10.1007/s10916-010-9589-6. cited By 3. [DOI] [PubMed] [Google Scholar]
- 91.Joranger P., Nesbakken A., Hoff G., Sorbye H., Oshaug A., Aas E. Modeling and validating the cost and clinical pathway of colorectal cancer. Med. Decis. Mak. 2015;35(2):255–265. doi: 10.1177/0272989X14544749. cited By 1. [DOI] [PubMed] [Google Scholar]
- 92.Karnon J., Jones T. A stochastic economic evaluation of letrozole versus tamoxifen as a first-line hormonal therapy: For advanced breast cancer in postmenopausal patients. PharmacoEconomics. 2003;21(7):513–525. doi: 10.2165/00019053-200321070-00006. cited By 23. [DOI] [PubMed] [Google Scholar]
- 93.Rejeb O., Pilet C., Hamana S., Xie X., Durand T., Aloui S., Doly A., Biron P., Perrier L., Augusto V. Performance and cost evaluation of health information systems using micro-costing and discrete-event simulation. Health Care Manag. Sci. 2018;21(2):204–223. doi: 10.1007/s10729-017-9402-x. cited By 1. [DOI] [PubMed] [Google Scholar]
- 94.Chemweno P., Thijs V., Pintelon L., Van Horenbeek A. Discrete event simulation case study: Diagnostic path for stroke patients in a stroke unit. Simul. Model. Pract. Theory. 2014;48:45–57. cited By 6. [Google Scholar]
- 95.Liu Z., Rexachs D., Epelde F., Luque E. An agent-based model for quantitatively analyzing and predicting the complex behavior of emergency departments. J. Comput. Sci. 2017;21:11–23. cited By 0. [Google Scholar]
- 96.Monks T., Pearson M., Pitt M., Stein K., James M.A. Evaluating the impact of a simulation study in emergency stroke care. Oper. Res. Health Care. 2015;6:40–49. cited By 4. [Google Scholar]
- 97.Shukla N., Lahiri S., Ceglarek D. Pathway variation analysis (PVA): Modelling and simulations. Oper. Res. Health Care. 2015;6:61–77. cited By 1. [Google Scholar]
- 98.Uzun Jacobson E., Bayer S., Barlow J., Dennis M., MacLeod M.J. The scope for improvement in hyper-acute stroke care in Scotland. Oper. Res. Health Care. 2015;6:50–60. cited By 1. [Google Scholar]
- 99.Cancer Research UK . 2014. National optimal lung cancer pathway. [Google Scholar]
- 100.Wales Cancer Network . 2019. Single cancer pathway. http://www.walescanet.wales.nhs.uk/single-cancer-pathway. [Google Scholar]
- 101.NHS Wales . 2019. National optimal pathway for lung cancer. [Google Scholar]
- 102.Healthcare Improvement Scotland . 2014. Management of lung cancer. https://www.sign.ac.uk/sign-137-management-of-lung-cancer.html. [Google Scholar]
- 103.Northern Ireland Cancer Network . 2020. Lung pathway. [Google Scholar]
- 104.Irish Cancer Society . 2019. Lung cancer action plan. [Google Scholar]
- 105.Textdistance Textdistance. Python package. 2017 [Google Scholar]
- 106.Levenshtein. V.I. Binary codes capable of correcting deletions, insertions, and reversals. Cybern. Control Theory. 1966;10(8) [Google Scholar]
- 107.Damerau. F.J. A technique for computer detection and correction of spelling errors. Commun. ACM. 1964;7(3):171–176. [Google Scholar]
- 108.Jaro. M.A. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Amer. Statist. Assoc. 1989;84(406) [Google Scholar]
- 109.Jaro. M.A. Probabilistic linkage of large public health data files. Stat. Med. 1995;14(5–7) doi: 10.1002/sim.4780140510. [DOI] [PubMed] [Google Scholar]
- 110.Winkler. W.E. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. Bureau Census. 1990 [Google Scholar]
- 111.Needleman S.B., Wunsch. C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- 112.Jaccard P. The distribution of the flora in the alpine zone. The New Phytologist. 1912;XI(2):37–50. [Google Scholar]
- 113.Steinbach M., Tan P.-N., Kumar. V. Pearson; 2005. Introduction to Data Mining, Vol. Chapter 8. [Google Scholar]
- 114.Maier. D. The complexity of some problems on subsequences and supersequences. J. ACM. 1978;25:322–336. [Google Scholar]
- 115.Novikov A. Pyclustering: Data mining library. J. Open Source Softw. 2019;4(36):1230. [Google Scholar]
- 116.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- 117.Aspland E. 2020. Github, sim.pro.flow. https://github.com/EmmaAspland/Sim.Pro.Flow. [Google Scholar]












