Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2014 Nov 14;2014:1208–1217.

Using String Metrics to Identify Patient Journeys through Care Pathways

Richard Williams 1,2, Iain E Buchan 1,2, Mattia Prosperi 1, John Ainsworth 1,2
PMCID: PMC4419997  PMID: 25954432

Abstract

Given a computerized representation of a care pathway and an electronic record of a patient’s clinical journey, with potential omissions, insertions, discontinuities and reordering, we show that we can accurately match the journey to a particular route through the pathway by converting the problem into a string matching one. We discover that normalized string metrics lead to more unique pathway matches than non-normalized string metrics and should therefore be given preference when using these techniques.

Introduction

When faced with a patient’s electronic health record (EHR) and a prescribed care pathway it is useful to know if that patient’s care has deviated from the expected route through the pathway1. The degree of deviation from a pathway calculated with a distance metric, when combined with outcome data, could lead to the discovery of instances where the standard of care has been suboptimal leading to adverse outcomes, and also to instances of localized practice that lead to better outcomes.

However, before determining distance from a given route, we need accurately to determine which route through the pathway was traversed by the patient. This is a problem because routinely collected patient information is often poorly recorded with missing data, incorrect coding practice and data recorded out of sequence.

String metrics provide the distance between two strings and are usually based on algorithms for matching strings to patterns, with various degrees of approximation. They typically involve performing operations such as insertion, deletion and substitution. The string metric can be normalized2,3 or non-normalized46.

We attempt to discover the routes patients took through a care pathway by using string matching methods in a novel way with electronic health records from Salford, UK.

Related Work

Representing a care pathway in a format that can be readily interpreted by a computer is essential for analysis and also enables health information systems to provide decision support to health care professionals7. Computer-interpretable guidelines (CIGs) are computer representations of the clinical knowledge in a clinical guideline and are usually networks of tasks that occur over time8. A recent review of CIGs shows there is ongoing work on CIG modelling languages, their integration with EHRs, validation and verification of CIGs, compliance monitoring and sharing9. Most CIG modelling is based on Task-Network Models8,9 of which our graph-based approach is a general case.

There is also a large body of work on process mining10,11, frequent pattern mining, and the use of hidden Markov models for trajectory clustering12 for healthcare data, which has been reviewed by Lakshmanan et al.13 However, each of these techniques begin with the healthcare data and attempts to interpolate the pathways taken, whereas our approach differs by starting with a well-defined care pathway and attempts to discover the route taken.

Background

Care Pathways

Care pathways are structured guidelines for the assessment, diagnosis, and treatment of patients with a given condition1,1416. They provide the ideal care that a patient should receive and are often represented as a flow chart1,14. In the UK, “NICE Pathways” (National Institute for Health and Care Excellence) offers pathways for over 150 conditions17.

More formally, a care pathway flow chart can be represented as a directed graph, G = (V, E), with V a set of nodes that represent clinical events such as diagnoses, measurements, procedures and treatments, and E a set of directed edges that correspond to the permitted transitions between nodes. A transition can occur in a determined amount of time. Figure 1 shows an example of a care pathway represented as a directed graph, defined a priori by experts.

Figure 1:

Figure 1:

A graphical model of a simplified, coded care pathway. Clinical codes in parentheses.

SINAP

The Stroke Improvement National Audit Programme (SINAP)18 is a data collection process for the purposes of clinical audit. It collects data about the care provided to stroke patients and includes several index events and the times they occurred. Here we examine data from Salford Royal Foundation Trust (SRFT) on 1078 patients with suspected strokes between 2010 and 2011. Figure 2 shows the approximate pathways that can be followed when a patient is admitted to hospital with a suspected stroke, covering the events recorded in the SINAP dataset. This is a simple pathway with only two decision points following when the patient is first seen and also after the patient has undergone brain imaging. The alphanumeric characters associated with each node in the pathway will be used later.

Figure 2:

Figure 2:

Stroke Improvement National Audit Programme (SINAP) pathway nodes as characters.

Electronic Health Record

A patient’s EHR is typically a list of coded events and states describing their care. In the UK a variety of coding schemes are used, such as Read Codes v219, CTV319, ICD-1020 and SNOMED21. The processes described in this paper can be used with any coding system: here we use the SINAP dataset that employs custom codes.

Method

Process

We first assign an alphanumeric character to each node in the graph. By using the Unicode22 character set we can manage care pathways with up to 65,536 nodes. We then extract every possible route through the pathway as a string made up of the characters assigned to each node. For a graph G with n possible routes we construct the set R = {R1, R2, ⋯, Rn},0020where each Ri is a string representing one of the n possible routes. For acyclic graphs such as the stroke pathway for the SINAP dataset this is straightforward via recursion. For a directed graph with cycles it is possible to repeat a cycle indefinitely so the number of possible routes is infinite. To avoid this we only allow each cycle to be repeated a finite number of times.

Due to the nature of our data, the events recorded are all covered by the pathway. In general, however, when using records from primary or secondary care, they may not be consistent with a care pathway event/transition graph. For a single patient we therefore extract all timed events from their record that occur on the pathway of interest, convert the events to characters, and concatenate the characters into strings according to their date-time order. The strings then represent the patient’s journey through the care pathway.

If our dataset contains patients with multiple interactions with the pathway, we must then distinguish between distinct interactions with the care pathway by specifying a cut-off time. If ever the gap between adjacent patient events is greater than the cut-off, then we assume that the patient has left the pathway and any subsequent events form part of the patient’s next visit to the pathway. This works well when the timescale of a pathway is shorter than the distances between them.

We then use the following string metrics to determine the distance between a patient pathway and each possible route through a care pathway.

Longest Common Subsequence

Formally, given two sequences A = a1a2am and B = b1b2bn (mn) we say that A is a subsequence of B if there are indices 0 < j1 < j2 < ⋯ < jmn such that ai=bji is true for i = 1,2, ⋯, m.

Given two sequences X and Y, Z is a common subsequence if it is a subsequence of both X and Y. Z is the longest common subsequence (LCS) if |Z| >= |Z’| for all common subsequences Z’, where |X| is the length of X. The LCS is not necessarily unique.

We are interested in which route through the pathway a patient took so we need to decide on a distance metric to convert the LCS into something more meaningful. An initial algorithm for a single patient is as follows:

  1. Create a list of all the possible routes R1, …, Rn through the care pathway

  2. Filter the patient’s events to just include pathway events and apply the time cut-off to give an event sequence E = E1Em

  3. For each route Ri calculate Li = LCS(Ri, E)

  4. If Li > 0 calculate the distance di = max (|Ri|, |E|) – |Li|

  5. Return the set of routes with the smallest distance

However, this only considers the discrepancy between the LCS and the pathway route; it doesn’t take into account the length of the LCS. We can normalize the distance by either dividing by the LCS, or by dividing by the combined length of the two strings and step 4 above becomes either:

  • 4. If Li > 0 calculate the distance di=max(|Ri|,|E|)-|Li||Li|

    or

    4. If Li > 0 calculate the distance di=max(|Ri|,|E|)-|Li||Ri|+|E|

    We call these two methods LCS1 and LCS2 respectively.

Simple Edit Distance (Levenshtein Distance)

An alternative to the LCS is to consider the edit distance or Levenshtein distance4. The edit distance between two strings X and Y is the minimum number of operations required to convert X into Y where an operation is either: insert a character, delete a character or replace a character. When switching is allowed (abba) the algorithm is the Damerau-Levenshtein5,6. The costs of inserting, deleting and replacing are given as WI, WD, and WR respectively. It holds that WRWD + WI, as we can always delete and then insert instead of substituting. By default the cost of each operation is 1.

The algorithm for our problem would be:

  1. Create a list of all the possible routes R1, …, Rn through the care pathway

  2. Filter the patient’s events to just include pathway events and apply the time cut-off to give an event sequence E = E1Em

  3. For each route Ri calculate the distance di = LEV(Ri, E)

  4. Return the set of routes with the smallest distance

Similarly we can do this for the Damerau-Levenshtein distance which we will notate as di = DAM(Ri, E).

Levenshtein Variants

Several versions of the Levenshtein Distance normalized to the length of the strings have been suggested. We notate the following as NLEV2.

NLEV(X,Y)=LEV(X,Y)|X|+|Y|

Also a normalized Levenshtein distance that satisfies the triangle equality and is therefore a true distance metric:

NLD(X,Y)=dNGLD(X,Y)=2LEV(X,Y)α(|X|+|Y|)+LEV(X,Y)

where α is whichever cost is greater out of insertion and deletion3. However, when a = 1, as is the case when all the weights are set to 1 by default, although the distances produced by NLD and NLEV will differ, the ordering of the matches will always be the same.

Finally, we consider a normalized version of the Damerau Levenshtein distance.

NDAM(X,Y)=DAM(X,Y)|X|+|Y|

We compare and contrast the different distance measures: LCS1, LCS2, LEV, DAM, NLEV, NLD and NDAM.

Data cleaning

Right censoring of the data is unlikely as once in hospital all end points are recorded. Most times in the data seem to be rounded to the nearest 10 or 15 minutes. This may potentially result in events appearing simultaneously or even out of order. There is also a risk of recollection or estimation bias as the data is often captured after the event.

When events occur at the same time there are several options available. The patient can be ignored, but this would result in a lot of data being excluded from the analysis. An alternative would be to perform the analysis on the data ordered randomly and let the string matching methods correct any discrepancies. However as we are interested in discovering the actual path the patient took, we can assume where possible the events occurred in the correct order.

For two events A and B on a pathway there is either: a one-way path from A to B, a one-way path from B to A, a path from A to B and B to A, or it is impossible to get from one to the other. For a group of events occurring at the same time if it is possible to order them in a unique way then we choose that as the order of the events. If it is not possible, because of a cycle or an unreachable node, then we discard that patient. For datasets where this is commonplace it may be better to include the patients discarded here and randomise the order of the cotemporaneous events. Alternatively we could just discard the events rather than the patient.

Similarly, events of unknown time, or those with just a date and not a time, can be inserted at the correct point of a patient record, if possible, or discarded if contradictions arise.

Data Management and Analysis Environment

The SINAP dataset was transferred to us via an encrypted external hard drive in CSV format. This was then uploaded to a Microsoft SQL Server 2008 database for analysis. Sequence matching was performed with C#.NET and all statistical analysis was done using R23. The sm library24 was used for plotting density curves and the pROC25 package was used for comparing Receiver Operating Characteristic (ROC) curves.

Results

Data Characteristics

The SINAP dataset contains 1078 patients of which 549 are female and 529 are male.

Table 1 shows the number of records that were cleaned using the above data cleaning process. Only 1 patient’s route could not be uniquely re-ordered.

Table 1.

Data cleaning results

Total patients 1078
Midnight events – able to insert 424
Simultaneous events – able to order 3
Midnight and simultaneous events – able to order 648
No midnight or simultaneous events – no need to order 2
Midnight events – unable to insert 1

There are 46 distinct pathways taken by the 1077 patients following time reordering. Table 2 shows the frequency of the top 10 patient pathways. The pathways that match the ICP are in bold. The route of GHDB should be a valid route however there are no patients in our cohort who followed this – suggesting this is not a valid route and the care pathway could be altered.

Table 2.

Top 10 pathways – character sequences from figure 2.

Patient Record Count Comments
GHDEFIB 275 (26%) Valid route
GHDFIB 275 (26%) Valid route
GHDFEIB 122 (11%) Valid route with E/F switched – lots of people so may be a valid route.
GDHEFIB 63 (6%) Valid route with D/H switched – can’t be seen before you arrive.
GDHFIB 60 (6%) Valid route with D/H switched – as above.
GHDEAFCIB 56 (5%) Valid route
GHEDFIB 39 (4%) Valid route with E/D switched – can’t be imaged before first seen.
GHDEFACIB 37 (3%) Valid with A/F switched.
GHFDIB 24 (2%) D/F switched – can’t arrive in specialist bed before being seen.
GDHFEIB 24 (2%) D/H and E/F switched

It appears that there are some valid routes that aren’t in our pathway. For those who don’t get thrombolysed there are many people who arrive in a specialist stroke bed prior to their brain scan. Also there are many people who get “First Seen” before they arrive at the hospital. This seems nonsensical but could be valid if “First Seen” applied to GPs or ambulance staff. Finally there are patients who receive thrombolysis after getting to a specialist stroke bed which could also be a valid route. All other switches appear to be mistakes – for example having a brain scan prior to being first seen.

In order to determine how well each method works we must determine for each patient the most probable route taken. As our dataset is small we can do this manually by defining rules based on the data. We first assume that events that don’t happen are rarely inserted and then classify the patients according to the following rules:

  1. If a patient has thrombolysis or a follow up scan then assumes route GHDEAFCIB

  2. Of those remaining, for any with a brain scan we assume route GHDEFIB

  3. Of those remaining, for any with a stroke unit arrival or discharge we assume route GHDFIB

  4. Of those remaining we assume GHDB

In addition to returning the correct result it is also of use if the distance measure returns a unique result. There will be situations where this isn’t possible but in general string matching methods that return more unique results are preferable.

For each method, Table 3 gives the number of unique matches and the number of correct matches where a correct match is one that is both unique and matches with the routes we assume the patients actually followed.

Table 3.

Number of unique and correct matches

Method Unique Matches Correct Matches Correct
LEV 818 (75.95%) 645 (78.85%) 59.89%
DAM 853 (79.20%) 849 (99.53%) 78.83%
LCS1 882 (81.89%) 878 (99.55%) 81.52%
LCS2 1077 (100.00%) 1070 (99.35%) 99.35%
NLEV 1076 (99.91%) 841 (78.16%) 78.09%
NLD 1076 (99.91%) 841 (78.16%) 78.09%
NDAM 1076 (99.91%) 1068 (99.26%) 99.16%

The NLEV and NLD methods produce the same results as predicted. The ratio of correct matches to unique matches shows that the Damerau-Levenshtein and the longest common subsequence methods work excellently with >99% correct, whereas the Levenshtein variants only achieve 78–79%. It can also be seen that normalized methods are better at producing unique matches with LCS2 matching all pathways uniquely, while NLEV, NLD and NDAM only fail to give a unique answer for a single patient – actually a different patient for each method. Examining the difference between NLEV and NDAM shows that NDAM is correctly identifying pathways where events have been recorded out of sequence. As an example the patient record of GHDFEIB is correctly matched to GHDEFIB by NDAM, while NLEV matches it to GHDFIB.

When the values for unique correct matches are combined the normalized Damerau-Levenshtein and the second Longest Common Subsequence methods are best, correctly matching >99% of the patient pathways.

For these two methods we can split the pathways into two groups: correct and incorrect matches, where a correct match is when the algorithm uniquely identifies the route the patient traversed through the pathway. We then compare the groups under the null hypothesis that the mean ‘string’ distance between them is equal. The density plots in Figure 3 demonstrate the data we want to contrast are not drawn from normal or symmetrical distributions, indeed the distributions of string distances are quite different for matches compared with non-matches. Thus we make the contrast with a non-parametric (Matt-Whitney) method26, demonstrating statistically highly significant differences for both NDAM (P < 0.0001) and LCS2 (P < 0.0001) metrics.

Figure 3:

Figure 3:

Box (top) and density (bottom) plots of string distances for matches (YES) and non-matches (NO) for NDAM and LCS2 metrics.

Finally, we compare NDAM, LCS2 and NLEV string distance metrics with regard to their classification accuracy for our care pathway journeys. Figure 4 shows the ROC curves for each metric with our test dataset, and the 95% confidence intervals for the areas under the curves: the more detailed comparison of the two most accurate metrics (NDAM and LCS2) is the Mann-Whitney result above.

Figure 4:

Figure 4:

Receiver Operating Characteristic (ROC) curves for NDAM, NLEV and LCS2 string distance metrics with 95% confidence intervals for the areas under the curves.

Discussion

Distance Weighting

The operations in the Damerau-Levenshtein string metric can be weighted. Given the nature of our dataset it is more likely that records were omitted or out of order, than miscoded. If we are sure of this we can change the weighting of the operations accordingly – an option that is possible with the NDAM and not the LCS2 method. By doubling the weight associated with deleting a character, therefore making it less likely that matches will feature deletions, of the 1077 patients we yield 1077 unique matches of which 1074 are correct. Weighted NDAM then becomes the most accurate way of predicting a patient’s route.

Generalization

The string matching process described here operates on a graph based representation of a care pathway. Therefore the methodology is theoretically applicable, although untested, to any process or workflow that can be represented as a graph, in healthcare and beyond.

Future work

There are several factors unstudied in this paper that will affect the overall success of the method. The size and shape of the graph is a factor, as is the quality of the data. Further work is needed to determine which graph shapes work well with this method. Finally, the next stage of our work is to determine how the distance a patient is from their care pathway predicts their outcomes.

Conclusion

String matching would seem to be a highly successful way to determine which route a patient followed in a care pathway. Normalized distance functions should be used to ensure high numbers of unique matches. For clinical data where the chance of events occurring, or being recorded, in the wrong order is high, the Damerau-Levenshtein or Longest Common Subsequence methods should be used in preference to the Levenshtein distance.

Acknowledgments

Funded by the National Institute for Health Research Greater Manchester Primary Care Patient Safety Translational Research Centre (NIHR GM PSTRC). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

References


Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES