Using String Metrics to Identify Patient Journeys through Care Pathways

Richard Williams; Iain E Buchan; Mattia Prosperi; John Ainsworth

. 2014 Nov 14;2014:1208–1217.

Using String Metrics to Identify Patient Journeys through Care Pathways

Richard Williams ^1,², Iain E Buchan ^1,², Mattia Prosperi ¹, John Ainsworth ^1,²

PMCID: PMC4419997 PMID: 25954432

Abstract

Given a computerized representation of a care pathway and an electronic record of a patient’s clinical journey, with potential omissions, insertions, discontinuities and reordering, we show that we can accurately match the journey to a particular route through the pathway by converting the problem into a string matching one. We discover that normalized string metrics lead to more unique pathway matches than non-normalized string metrics and should therefore be given preference when using these techniques.

Introduction

When faced with a patient’s electronic health record (EHR) and a prescribed care pathway it is useful to know if that patient’s care has deviated from the expected route through the pathway¹. The degree of deviation from a pathway calculated with a distance metric, when combined with outcome data, could lead to the discovery of instances where the standard of care has been suboptimal leading to adverse outcomes, and also to instances of localized practice that lead to better outcomes.

However, before determining distance from a given route, we need accurately to determine which route through the pathway was traversed by the patient. This is a problem because routinely collected patient information is often poorly recorded with missing data, incorrect coding practice and data recorded out of sequence.

String metrics provide the distance between two strings and are usually based on algorithms for matching strings to patterns, with various degrees of approximation. They typically involve performing operations such as insertion, deletion and substitution. The string metric can be normalized²^,³ or non-normalized⁴^–⁶.

We attempt to discover the routes patients took through a care pathway by using string matching methods in a novel way with electronic health records from Salford, UK.

Related Work

Representing a care pathway in a format that can be readily interpreted by a computer is essential for analysis and also enables health information systems to provide decision support to health care professionals⁷. Computer-interpretable guidelines (CIGs) are computer representations of the clinical knowledge in a clinical guideline and are usually networks of tasks that occur over time⁸. A recent review of CIGs shows there is ongoing work on CIG modelling languages, their integration with EHRs, validation and verification of CIGs, compliance monitoring and sharing⁹. Most CIG modelling is based on Task-Network Models⁸^,⁹ of which our graph-based approach is a general case.

There is also a large body of work on process mining¹⁰^,¹¹, frequent pattern mining, and the use of hidden Markov models for trajectory clustering¹² for healthcare data, which has been reviewed by Lakshmanan et al.¹³ However, each of these techniques begin with the healthcare data and attempts to interpolate the pathways taken, whereas our approach differs by starting with a well-defined care pathway and attempts to discover the route taken.

Background

Care Pathways

Care pathways are structured guidelines for the assessment, diagnosis, and treatment of patients with a given condition¹^,¹⁴^–¹⁶. They provide the ideal care that a patient should receive and are often represented as a flow chart¹^,¹⁴. In the UK, “NICE Pathways” (National Institute for Health and Care Excellence) offers pathways for over 150 conditions¹⁷.

More formally, a care pathway flow chart can be represented as a directed graph, G = (V, E), with V a set of nodes that represent clinical events such as diagnoses, measurements, procedures and treatments, and E a set of directed edges that correspond to the permitted transitions between nodes. A transition can occur in a determined amount of time. Figure 1 shows an example of a care pathway represented as a directed graph, defined a priori by experts.

Figure 1: — A graphical model of a simplified, coded care pathway. Clinical codes in parentheses.

SINAP

The Stroke Improvement National Audit Programme (SINAP)¹⁸ is a data collection process for the purposes of clinical audit. It collects data about the care provided to stroke patients and includes several index events and the times they occurred. Here we examine data from Salford Royal Foundation Trust (SRFT) on 1078 patients with suspected strokes between 2010 and 2011. Figure 2 shows the approximate pathways that can be followed when a patient is admitted to hospital with a suspected stroke, covering the events recorded in the SINAP dataset. This is a simple pathway with only two decision points following when the patient is first seen and also after the patient has undergone brain imaging. The alphanumeric characters associated with each node in the pathway will be used later.

Figure 2: — Stroke Improvement National Audit Programme (SINAP) pathway nodes as characters.

Electronic Health Record

A patient’s EHR is typically a list of coded events and states describing their care. In the UK a variety of coding schemes are used, such as Read Codes v2¹⁹, CTV3¹⁹, ICD-10²⁰ and SNOMED²¹. The processes described in this paper can be used with any coding system: here we use the SINAP dataset that employs custom codes.

Method

Process

We first assign an alphanumeric character to each node in the graph. By using the Unicode²² character set we can manage care pathways with up to 65,536 nodes. We then extract every possible route through the pathway as a string made up of the characters assigned to each node. For a graph G with n possible routes we construct the set R = {R₁, R₂, ⋯, R_n},0020where each R_i is a string representing one of the n possible routes. For acyclic graphs such as the stroke pathway for the SINAP dataset this is straightforward via recursion. For a directed graph with cycles it is possible to repeat a cycle indefinitely so the number of possible routes is infinite. To avoid this we only allow each cycle to be repeated a finite number of times.

Due to the nature of our data, the events recorded are all covered by the pathway. In general, however, when using records from primary or secondary care, they may not be consistent with a care pathway event/transition graph. For a single patient we therefore extract all timed events from their record that occur on the pathway of interest, convert the events to characters, and concatenate the characters into strings according to their date-time order. The strings then represent the patient’s journey through the care pathway.

If our dataset contains patients with multiple interactions with the pathway, we must then distinguish between distinct interactions with the care pathway by specifying a cut-off time. If ever the gap between adjacent patient events is greater than the cut-off, then we assume that the patient has left the pathway and any subsequent events form part of the patient’s next visit to the pathway. This works well when the timescale of a pathway is shorter than the distances between them.

We then use the following string metrics to determine the distance between a patient pathway and each possible route through a care pathway.

Longest Common Subsequence

Formally, given two sequences A = a₁a₂ ⋯ a_m and B = b₁b₂ ⋯ b_n (m ≤ n) we say that A is a subsequence of B if there are indices 0 < j₁ < j₂ < ⋯ < j_m ≤ n such that $a_{i} = b_{j_{i}}$ is true for i = 1,2, ⋯, m.

Given two sequences X and Y, Z is a common subsequence if it is a subsequence of both X and Y. Z is the longest common subsequence (LCS) if |Z| >= |Z’| for all common subsequences Z’, where |X| is the length of X. The LCS is not necessarily unique.

We are interested in which route through the pathway a patient took so we need to decide on a distance metric to convert the LCS into something more meaningful. An initial algorithm for a single patient is as follows:

Create a list of all the possible routes R₁, …, R_n through the care pathway
Filter the patient’s events to just include pathway events and apply the time cut-off to give an event sequence E = E₁ … E_m
For each route R_i calculate L_i = LCS(R_i, E)
If L_i > 0 calculate the distance d_i = max (|R_i|, |E|) – |L_i|
Return the set of routes with the smallest distance

However, this only considers the discrepancy between the LCS and the pathway route; it doesn’t take into account the length of the LCS. We can normalize the distance by either dividing by the LCS, or by dividing by the combined length of the two strings and step 4 above becomes either:

4. If L_i > 0 calculate the distance $d_{i} = \frac{\max (| R_{i} |, | E |) - | L_{i} |}{| L_{i} |}$

or

4. If L_i > 0 calculate the distance $d_{i} = \frac{\max (| R_{i} |, | E |) - | L_{i} |}{| R_{i} | + | E |}$

We call these two methods LCS1 and LCS2 respectively.

Simple Edit Distance (Levenshtein Distance)

An alternative to the LCS is to consider the edit distance or Levenshtein distance⁴. The edit distance between two strings X and Y is the minimum number of operations required to convert X into Y where an operation is either: insert a character, delete a character or replace a character. When switching is allowed (ab → ba) the algorithm is the Damerau-Levenshtein⁵^,⁶. The costs of inserting, deleting and replacing are given as W_I, W_D, and W_R respectively. It holds that W_R ≤ W_D + W_I, as we can always delete and then insert instead of substituting. By default the cost of each operation is 1.

The algorithm for our problem would be:

Create a list of all the possible routes R₁, …, R_n through the care pathway
Filter the patient’s events to just include pathway events and apply the time cut-off to give an event sequence E = E₁ … E_m
For each route R_i calculate the distance d_i = LEV(R_i, E)
Return the set of routes with the smallest distance

Similarly we can do this for the Damerau-Levenshtein distance which we will notate as d_i = DAM(R_i, E).

Levenshtein Variants

Several versions of the Levenshtein Distance normalized to the length of the strings have been suggested. We notate the following as NLEV².

NLEV (X, Y) = \frac{LEV (X, Y)}{| X | + | Y |}

Also a normalized Levenshtein distance that satisfies the triangle equality and is therefore a true distance metric:

NLD (X, Y) = d_{N - G L D} (X, Y) = \frac{2 \cdot L E V (X, Y)}{α \cdot (| X | + | Y |) + L E V (X, Y)}

where α is whichever cost is greater out of insertion and deletion³. However, when a = 1, as is the case when all the weights are set to 1 by default, although the distances produced by NLD and NLEV will differ, the ordering of the matches will always be the same.

Finally, we consider a normalized version of the Damerau Levenshtein distance.

NDAM (X, Y) = \frac{D A M (X, Y)}{| X | + | Y |}

We compare and contrast the different distance measures: LCS1, LCS2, LEV, DAM, NLEV, NLD and NDAM.

Data cleaning

Right censoring of the data is unlikely as once in hospital all end points are recorded. Most times in the data seem to be rounded to the nearest 10 or 15 minutes. This may potentially result in events appearing simultaneously or even out of order. There is also a risk of recollection or estimation bias as the data is often captured after the event.

When events occur at the same time there are several options available. The patient can be ignored, but this would result in a lot of data being excluded from the analysis. An alternative would be to perform the analysis on the data ordered randomly and let the string matching methods correct any discrepancies. However as we are interested in discovering the actual path the patient took, we can assume where possible the events occurred in the correct order.

For two events A and B on a pathway there is either: a one-way path from A to B, a one-way path from B to A, a path from A to B and B to A, or it is impossible to get from one to the other. For a group of events occurring at the same time if it is possible to order them in a unique way then we choose that as the order of the events. If it is not possible, because of a cycle or an unreachable node, then we discard that patient. For datasets where this is commonplace it may be better to include the patients discarded here and randomise the order of the cotemporaneous events. Alternatively we could just discard the events rather than the patient.

Similarly, events of unknown time, or those with just a date and not a time, can be inserted at the correct point of a patient record, if possible, or discarded if contradictions arise.

Data Management and Analysis Environment

The SINAP dataset was transferred to us via an encrypted external hard drive in CSV format. This was then uploaded to a Microsoft SQL Server 2008 database for analysis. Sequence matching was performed with C#.NET and all statistical analysis was done using R²³. The sm library²⁴ was used for plotting density curves and the pROC²⁵ package was used for comparing Receiver Operating Characteristic (ROC) curves.

Results

Data Characteristics

The SINAP dataset contains 1078 patients of which 549 are female and 529 are male.

Table 1 shows the number of records that were cleaned using the above data cleaning process. Only 1 patient’s route could not be uniquely re-ordered.

Table 1.

Data cleaning results

Total patients	1078
Midnight events – able to insert	424
Simultaneous events – able to order	3
Midnight and simultaneous events – able to order	648
No midnight or simultaneous events – no need to order	2
Midnight events – unable to insert	1

Open in a new tab

There are 46 distinct pathways taken by the 1077 patients following time reordering. Table 2 shows the frequency of the top 10 patient pathways. The pathways that match the ICP are in bold. The route of GHDB should be a valid route however there are no patients in our cohort who followed this – suggesting this is not a valid route and the care pathway could be altered.

Table 2.

Top 10 pathways – character sequences from figure 2.

Patient Record	Count	Comments
GHDEFIB	275 (26%)	Valid route
GHDFIB	275 (26%)	Valid route
GHDFEIB	122 (11%)	Valid route with E/F switched – lots of people so may be a valid route.
GDHEFIB	63 (6%)	Valid route with D/H switched – can’t be seen before you arrive.
GDHFIB	60 (6%)	Valid route with D/H switched – as above.
GHDEAFCIB	56 (5%)	Valid route
GHEDFIB	39 (4%)	Valid route with E/D switched – can’t be imaged before first seen.
GHDEFACIB	37 (3%)	Valid with A/F switched.
GHFDIB	24 (2%)	D/F switched – can’t arrive in specialist bed before being seen.
GDHFEIB	24 (2%)	D/H and E/F switched

Open in a new tab

It appears that there are some valid routes that aren’t in our pathway. For those who don’t get thrombolysed there are many people who arrive in a specialist stroke bed prior to their brain scan. Also there are many people who get “First Seen” before they arrive at the hospital. This seems nonsensical but could be valid if “First Seen” applied to GPs or ambulance staff. Finally there are patients who receive thrombolysis after getting to a specialist stroke bed which could also be a valid route. All other switches appear to be mistakes – for example having a brain scan prior to being first seen.

In order to determine how well each method works we must determine for each patient the most probable route taken. As our dataset is small we can do this manually by defining rules based on the data. We first assume that events that don’t happen are rarely inserted and then classify the patients according to the following rules:

If a patient has thrombolysis or a follow up scan then assumes route GHDEAFCIB
Of those remaining, for any with a brain scan we assume route GHDEFIB
Of those remaining, for any with a stroke unit arrival or discharge we assume route GHDFIB
Of those remaining we assume GHDB

In addition to returning the correct result it is also of use if the distance measure returns a unique result. There will be situations where this isn’t possible but in general string matching methods that return more unique results are preferable.

For each method, Table 3 gives the number of unique matches and the number of correct matches where a correct match is one that is both unique and matches with the routes we assume the patients actually followed.

Table 3.

Number of unique and correct matches

Method	Unique Matches	Correct Matches	Correct
LEV	818 (75.95%)	645 (78.85%)	59.89%
DAM	853 (79.20%)	849 (99.53%)	78.83%
LCS1	882 (81.89%)	878 (99.55%)	81.52%
LCS2	1077 (100.00%)	1070 (99.35%)	99.35%
NLEV	1076 (99.91%)	841 (78.16%)	78.09%
NLD	1076 (99.91%)	841 (78.16%)	78.09%
NDAM	1076 (99.91%)	1068 (99.26%)	99.16%

Open in a new tab

The NLEV and NLD methods produce the same results as predicted. The ratio of correct matches to unique matches shows that the Damerau-Levenshtein and the longest common subsequence methods work excellently with >99% correct, whereas the Levenshtein variants only achieve 78–79%. It can also be seen that normalized methods are better at producing unique matches with LCS2 matching all pathways uniquely, while NLEV, NLD and NDAM only fail to give a unique answer for a single patient – actually a different patient for each method. Examining the difference between NLEV and NDAM shows that NDAM is correctly identifying pathways where events have been recorded out of sequence. As an example the patient record of GHDFEIB is correctly matched to GHDEFIB by NDAM, while NLEV matches it to GHDFIB.

When the values for unique correct matches are combined the normalized Damerau-Levenshtein and the second Longest Common Subsequence methods are best, correctly matching >99% of the patient pathways.

For these two methods we can split the pathways into two groups: correct and incorrect matches, where a correct match is when the algorithm uniquely identifies the route the patient traversed through the pathway. We then compare the groups under the null hypothesis that the mean ‘string’ distance between them is equal. The density plots in Figure 3 demonstrate the data we want to contrast are not drawn from normal or symmetrical distributions, indeed the distributions of string distances are quite different for matches compared with non-matches. Thus we make the contrast with a non-parametric (Matt-Whitney) method²⁶, demonstrating statistically highly significant differences for both NDAM (P < 0.0001) and LCS2 (P < 0.0001) metrics.

Figure 3: — Box (top) and density (bottom) plots of string distances for matches (YES) and non-matches (NO) for NDAM and LCS2 metrics.

Finally, we compare NDAM, LCS2 and NLEV string distance metrics with regard to their classification accuracy for our care pathway journeys. Figure 4 shows the ROC curves for each metric with our test dataset, and the 95% confidence intervals for the areas under the curves: the more detailed comparison of the two most accurate metrics (NDAM and LCS2) is the Mann-Whitney result above.

Figure 4: — Receiver Operating Characteristic (ROC) curves for NDAM, NLEV and LCS2 string distance metrics with 95% confidence intervals for the areas under the curves.

Discussion

Distance Weighting

The operations in the Damerau-Levenshtein string metric can be weighted. Given the nature of our dataset it is more likely that records were omitted or out of order, than miscoded. If we are sure of this we can change the weighting of the operations accordingly – an option that is possible with the NDAM and not the LCS2 method. By doubling the weight associated with deleting a character, therefore making it less likely that matches will feature deletions, of the 1077 patients we yield 1077 unique matches of which 1074 are correct. Weighted NDAM then becomes the most accurate way of predicting a patient’s route.

Generalization

The string matching process described here operates on a graph based representation of a care pathway. Therefore the methodology is theoretically applicable, although untested, to any process or workflow that can be represented as a graph, in healthcare and beyond.

Future work

There are several factors unstudied in this paper that will affect the overall success of the method. The size and shape of the graph is a factor, as is the quality of the data. Further work is needed to determine which graph shapes work well with this method. Finally, the next stage of our work is to determine how the distance a patient is from their care pathway predicts their outcomes.

Conclusion

String matching would seem to be a highly successful way to determine which route a patient followed in a care pathway. Normalized distance functions should be used to ensure high numbers of unique matches. For clinical data where the chance of events occurring, or being recorded, in the wrong order is high, the Damerau-Levenshtein or Longest Common Subsequence methods should be used in preference to the Levenshtein distance.

Acknowledgments

Funded by the National Institute for Health Research Greater Manchester Primary Care Patient Safety Translational Research Centre (NIHR GM PSTRC). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

References

1.Ainsworth J, Buchan I. COCPIT: A Tool for Integrated Care Pathway Variance Analysis. Stud Health Technol Inform. 2012;180:995–9. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22874343. [PubMed] [Google Scholar]
2.Marzal A, Vidal E. Computation of normalized edit distance and applications. IEEE Trans Pattern Anal Mach Intell. 1993;15 doi: 10.1109/34.232078. [DOI] [Google Scholar]
3.Yujian L, Bo L. A normalized Levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell. 2007;29:1091–1095. doi: 10.1109/TPAMI.2007.1078. [DOI] [PubMed] [Google Scholar]
4.Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov Phys Dokl. 1966;10:707–710. Available at: http://adsabs.harvard.edu/abs/1966SPhD…10.707L. [Google Scholar]
5.Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964;7:171–176. doi: 10.1145/363958.363994. [DOI] [Google Scholar]
6.Oommen BJ, Loke RKS. Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions. Pattern Recognit. 1997;30(5):789–800. doi: 10.1016/S0031-3203(96)00101-X. [DOI] [Google Scholar]
7.Gooch P, Roudsari A. Computerization of workflows, guidelines, and care pathways: a review of implementation challenges for process-oriented health information systems. J Am Med Inform Assoc. 2011;18(6):738–48. doi: 10.1136/amiajnl-2010-000033. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Peleg M, Tu S, Bury J, et al. Comparing Computer-interpretable Guideline Models: A Case-study Approach. J Am Med Informatics Assoc. 2003;10(1):52–68. doi: 10.1197/jamia.M1135. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Peleg M. Computer-interpretable clinical guidelines: a methodological review. J Biomed Inform. 2013;46(4):744–63. doi: 10.1016/j.jbi.2013.06.009. [DOI] [PubMed] [Google Scholar]
10.Huang Z, Dong W, Ji L, Gan C, Lu X, Duan H. Discovery of clinical pathway patterns from event logs using probabilistic topic models. J Biomed Inform. 2014;47:39–57. doi: 10.1016/j.jbi.2013.09.003. [DOI] [PubMed] [Google Scholar]
11.Kaymak U, Mans R, Van De Steeg T, Dierks M. On process mining in health care. 2012 IEEE Int Conf Syst Man, Cybern; 2012. pp. 1859–1864. [DOI] [Google Scholar]
12.Poelmans J, Dedene G. Combining business process and data discovery techniques for analyzing and improving integrated care pathways. Adv Data …. 2010. Available at: http://link.springer.com/chapter/10.1007/978-3-642-14400-4_39. Accessed July 8, 2014.
13.Lakshmanan G, Rozsnyai S, Wang F. Investigating clinical care pathways correlated with outcomes. Bus Process Manag. 2013. pp. 323–338. Available at: http://link.springer.com/chapter/10.1007/978-3-642-40176-3_27. Accessed July 8, 2014.
14.Schrijvers G, van Hoorn A, Huiskes N. The care pathway: concepts and theories: an introduction. Int J Integr Care. 2012;12(Spec Ed Integrated Care Pathways):e192. doi: 10.5334/ijic.812. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3602959&tool=pmcentrez&rendertype=abstract. Accessed March 4, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Campbell H, Hotchkiss R, Bradshaw N, Porteous M. Integrated care pathways. BMJ. 1998;316(7125):133–7. doi: 10.1136/bmj.316.7125.133. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2665398&tool=pmcentrez&rendertype=abstract. Accessed March 4, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Riley K. Care pathways. Paving the way. Health Serv J. 1998;108(5597):30–1. Available at: http://www.ncbi.nlm.nih.gov/pubmed/10177611. Accessed March 4, 2014. [PubMed] [Google Scholar]
17.NICE Pathways Available at: http://pathways.nice.org.uk/. Accessed March 4, 2014.
18.SINAP (Stroke Improvement National Audit Programme) | Royal College of Physicians. Available at: http://www.rcplondon.ac.uk/projects/stroke-improvement-national-audit-programme-sinap. Accessed March 4, 2014.
19.NHS Connecting for Health NHS Connecting for Health – Read Codes. {NHS} Connect Heal. 2013. Available at: http://www.connectingforhealth.nhs.uk/systemsandservices/data/uktc/readcodes.
20.ICD-10 Classification — NHS Connecting for Health. Available at: http://www.connectingforhealth.nhs.uk/systemsandservices/data/clinicalcoding/codingstandards/icd10. Accessed March 8, 2014.
21.Release I, International T, Terminology H, Development S. SNOMED Clinical Terms Technical Reference Guide. Development. 2008. p. 164. Available at: http://htg.his.uvic.ca/index.php?ContentFileId=57.
22.Needleman M. The Unicode Standard. Ser Rev. 2000;26:51–54. doi: 10.1016/S0098-7913(00)00059-9. [DOI] [Google Scholar]
23.R Core Team R: A Language and Environment for Statistical Computing. 2013. Available at: http://www.r-project.org/
24.Bowman AW, Azzalini A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-plus Illustrations. 1997:982. doi: 10.2307/2670015.. [DOI] [Google Scholar]
25.Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.With S. Wilcoxon – Mann – Whitney. Stat Surv. 1945;4:1–3. doi: 10.1214/09-SS051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b1-1985068] 1.Ainsworth J, Buchan I. COCPIT: A Tool for Integrated Care Pathway Variance Analysis. Stud Health Technol Inform. 2012;180:995–9. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22874343. [PubMed] [Google Scholar]

[b2-1985068] 2.Marzal A, Vidal E. Computation of normalized edit distance and applications. IEEE Trans Pattern Anal Mach Intell. 1993;15 doi: 10.1109/34.232078. [DOI] [Google Scholar]

[b3-1985068] 3.Yujian L, Bo L. A normalized Levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell. 2007;29:1091–1095. doi: 10.1109/TPAMI.2007.1078. [DOI] [PubMed] [Google Scholar]

[b4-1985068] 4.Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov Phys Dokl. 1966;10:707–710. Available at: http://adsabs.harvard.edu/abs/1966SPhD…10.707L. [Google Scholar]

[b5-1985068] 5.Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964;7:171–176. doi: 10.1145/363958.363994. [DOI] [Google Scholar]

[b6-1985068] 6.Oommen BJ, Loke RKS. Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions. Pattern Recognit. 1997;30(5):789–800. doi: 10.1016/S0031-3203(96)00101-X. [DOI] [Google Scholar]

[b7-1985068] 7.Gooch P, Roudsari A. Computerization of workflows, guidelines, and care pathways: a review of implementation challenges for process-oriented health information systems. J Am Med Inform Assoc. 2011;18(6):738–48. doi: 10.1136/amiajnl-2010-000033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8-1985068] 8.Peleg M, Tu S, Bury J, et al. Comparing Computer-interpretable Guideline Models: A Case-study Approach. J Am Med Informatics Assoc. 2003;10(1):52–68. doi: 10.1197/jamia.M1135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9-1985068] 9.Peleg M. Computer-interpretable clinical guidelines: a methodological review. J Biomed Inform. 2013;46(4):744–63. doi: 10.1016/j.jbi.2013.06.009. [DOI] [PubMed] [Google Scholar]

[b10-1985068] 10.Huang Z, Dong W, Ji L, Gan C, Lu X, Duan H. Discovery of clinical pathway patterns from event logs using probabilistic topic models. J Biomed Inform. 2014;47:39–57. doi: 10.1016/j.jbi.2013.09.003. [DOI] [PubMed] [Google Scholar]

[b11-1985068] 11.Kaymak U, Mans R, Van De Steeg T, Dierks M. On process mining in health care. 2012 IEEE Int Conf Syst Man, Cybern; 2012. pp. 1859–1864. [DOI] [Google Scholar]

[b12-1985068] 12.Poelmans J, Dedene G. Combining business process and data discovery techniques for analyzing and improving integrated care pathways. Adv Data …. 2010. Available at: http://link.springer.com/chapter/10.1007/978-3-642-14400-4_39. Accessed July 8, 2014.

[b13-1985068] 13.Lakshmanan G, Rozsnyai S, Wang F. Investigating clinical care pathways correlated with outcomes. Bus Process Manag. 2013. pp. 323–338. Available at: http://link.springer.com/chapter/10.1007/978-3-642-40176-3_27. Accessed July 8, 2014.

[b14-1985068] 14.Schrijvers G, van Hoorn A, Huiskes N. The care pathway: concepts and theories: an introduction. Int J Integr Care. 2012;12(Spec Ed Integrated Care Pathways):e192. doi: 10.5334/ijic.812. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3602959&tool=pmcentrez&rendertype=abstract. Accessed March 4, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b15-1985068] 15.Campbell H, Hotchkiss R, Bradshaw N, Porteous M. Integrated care pathways. BMJ. 1998;316(7125):133–7. doi: 10.1136/bmj.316.7125.133. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2665398&tool=pmcentrez&rendertype=abstract. Accessed March 4, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16-1985068] 16.Riley K. Care pathways. Paving the way. Health Serv J. 1998;108(5597):30–1. Available at: http://www.ncbi.nlm.nih.gov/pubmed/10177611. Accessed March 4, 2014. [PubMed] [Google Scholar]

[b17-1985068] 17.NICE Pathways Available at: http://pathways.nice.org.uk/. Accessed March 4, 2014.

[b18-1985068] 18.SINAP (Stroke Improvement National Audit Programme) | Royal College of Physicians. Available at: http://www.rcplondon.ac.uk/projects/stroke-improvement-national-audit-programme-sinap. Accessed March 4, 2014.

[b19-1985068] 19.NHS Connecting for Health NHS Connecting for Health – Read Codes. {NHS} Connect Heal. 2013. Available at: http://www.connectingforhealth.nhs.uk/systemsandservices/data/uktc/readcodes.

[b20-1985068] 20.ICD-10 Classification — NHS Connecting for Health. Available at: http://www.connectingforhealth.nhs.uk/systemsandservices/data/clinicalcoding/codingstandards/icd10. Accessed March 8, 2014.

[b21-1985068] 21.Release I, International T, Terminology H, Development S. SNOMED Clinical Terms Technical Reference Guide. Development. 2008. p. 164. Available at: http://htg.his.uvic.ca/index.php?ContentFileId=57.

[b22-1985068] 22.Needleman M. The Unicode Standard. Ser Rev. 2000;26:51–54. doi: 10.1016/S0098-7913(00)00059-9. [DOI] [Google Scholar]

[b23-1985068] 23.R Core Team R: A Language and Environment for Statistical Computing. 2013. Available at: http://www.r-project.org/

[b24-1985068] 24.Bowman AW, Azzalini A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-plus Illustrations. 1997:982. doi: 10.2307/2670015.. [DOI] [Google Scholar]

[b25-1985068] 25.Robin X, Turck N, Hainard A, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b26-1985068] 26.With S. Wilcoxon – Mann – Whitney. Stat Surv. 1945;4:1–3. doi: 10.1214/09-SS051. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Using String Metrics to Identify Patient Journeys through Care Pathways

Richard Williams, BA

Iain E Buchan, MD, FACMI

Mattia Prosperi, M.Eng., Ph.D

John Ainsworth, BSc, MSc

Abstract

Introduction

Related Work

Background

Care Pathways

Figure 1:

SINAP

Figure 2:

Electronic Health Record

Method

Process

Longest Common Subsequence

Simple Edit Distance (Levenshtein Distance)

Levenshtein Variants

Data cleaning

Data Management and Analysis Environment

Results

Data Characteristics

Table 1.

Table 2.

Table 3.

Figure 3:

Figure 4:

Discussion

Distance Weighting

Generalization

Future work

Conclusion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases