Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2005 Nov-Dec;12(6):630–641. doi: 10.1197/jamia.M1714

Use of Graph Theory to Identify Patterns of Deprivation and High Morbidity and Mortality in Public Health Data Sets

Peter A Bath 1, Cheryl Craigs 1, Ravi Maheswaran 1, John Raymond 1, Peter Willett 1
PMCID: PMC1294034  PMID: 16049232

Abstract

Objective: An important part of public health is identifying patterns of poor health and deprivation. Specific patterns of poor health may be associated with features of the geographic environment where contamination or pollution may be occurring. For example, there may be clusters of poor health surrounding nuclear power stations, whereas major roads or rivers may be associated with areas of poor health alongside the feature in chains. Current methods are limited in their capacity to search for complex patterns in geographic data sets. The objective of this study was to determine whether graph theory could be used to identify patterns of geographic areas that have high levels of deprivation, morbidity, and mortality in a public health database. The geographic areas used in the study were enumeration districts (EDs), which are the lowest level of census geography in England and Wales, representing on average 200 households in the 1991 census. More specifically, the study aimed to identify chains of EDs with high deprivation, morbidity, and mortality that might be adjacent to specific types of geographic features, i.e., rivers or major roads.

Design: The maximum common subgraph (MCS) algorithm was used to search for seven query patterns of deprivation and poor health within the Trent region. Query pattern 1 represented a linear chain of five EDs and query patterns 2 to 7 represented the possible clusters of the five EDs. To identify chains of EDs with high deprivation, morbidity, and mortality, the results from the query patterns 2 to 7 were used to remove patterns (option 1) and EDs (option 2) from the results of query pattern 1.

Measurements: Data on the Townsend Material Deprivation Index, standardized long-term limiting illness and standardized all-cause mortality rates were used for the 10,665 EDs within the Trent region.

Results: The MCS algorithm retrieved a range of patterns and EDs from the database for the queries. Query pattern 1 identified 3,838 patterns containing a total of 195 EDs. When the patterns retrieved using query patterns 2 to 7 were removed from the 3,838 patterns using option 1, 1,704 patterns remained containing 161 EDs. When the EDs retrieved using query patterns 2 to 7 were removed from the 195 EDs identified by query pattern 1 using option 2, 12 EDs remained. The MCS algorithm was therefore able to reduce the numbers of patterns and EDs to allow manual examination for chains of EDs and for that which might be associated with them.

Conclusion: The study demonstrates the potential of the MCS algorithm for searching for specific patterns of need. This method has potential for identifying such patterns in relation to local geographic features for public health.

Background

Identifying disease clusters, e.g., clusters of childhood leukemia and outbreaks of communicable diseases is a major issue in public health medicine. Similarly, identifying multidimensional patterns within large databases, e.g., for analyzing use of health services in relation to needs and for examining socioeconomic differentials in health, forms an important part of public health surveillance. Although a large amount of research has been directed at the identification of geographic disease clusters1 and different methods have been developed by Openshaw et al.,2 Knox,3 Besag and Newell,4 and Kulldorff,5 the currently available methods are limited in their capacity for searching for complex patterns. Part of the problem is that although data are available to represent levels of morbidity, service use, mortality, and deprivation (i.e., having inadequate housing, resources, and/or employment) within geographic areas, identifying associations between geographic units in relation to time and space is computationally complex. The study described in this paper has adapted techniques from the field of computational chemistry to search for patterns of interest in public health.6

In computational chemistry, sophisticated methods based upon graph theory have been developed for storing and retrieving various types of chemical information efficiently, including two- (2D) and three-dimensional (3D) chemical structures.7,8 Highly specified, sophisticated, and flexible searches can be carried out using computationally tractable search algorithms within large databases of molecular structures. These methods have been successfully adapted and validated for use in searching for patterns of deprivation and mortality,6 and the resulting program has been used for identifying a variety of specific patterns of socioeconomic deprivation and morbidity.9 Using a program such as this to search for specific shapes of patterns of deprivation and poor health is of particular value within public health when used to identify patterns in relation to the geographic environment. For example, there may be clusters of areas of high morbidity around and in close proximity to discrete geographic features, e.g., a nuclear power station or landfill site, where potentially harmful substances are processed or produced and that may be released into the environment. Such health risks may be increased in deprived areas, where people living in inadequate housing and overcrowded conditions have poorer health, and are more susceptible to environmental influences. Identification of clusters of high morbidity and/or mortality in deprived areas may help to identify hazardous features and help public health specialists to highlight local public health problems. Living in close proximity to other geographic features, e.g., major roads or rivers, may also be potentially hazardous due, for example, to air pollution from traffic or release of effluents into a water course. However, such features are not discrete, but may be present over some distance and may appear as linear chains of areas of high morbidity, exacerbated by high levels of deprivation, along the length of the feature rather than surrounding it in clusters. Identifying chains of high morbidity and deprivation may help to identify different types of potentially hazardous geographic features.

While the work of Bath et al.9 showed that it is computationally possible, if expensive, to identify quite large clusters of deprivation and morbidity, identifying areas that occur only in chains is potentially complicated because clusters consist of chains that happen to have greater connectivity than areas in chains and separating chains and clusters is computationally a nontrivial task. This paper describes new work to extend the approach and tackle this problem.

The aim of the study reported here, therefore, was to determine whether our graph-based program, called RASCAL,10 could be applied to the identification of chains of geographic areas all with similar demographic and health attributes, i.e., high levels of deprivation, high levels of long-term limiting illness, and high levels of mortality. In this paper, we describe in detail the maximum common subgraph (MCS) algorithm that lies at the heart of RASCAL and describe and evaluate our attempts to identify particular chains of geographic areas with similar demographic and health attributes. The purpose of the paper is to discuss the potential benefits of this approach, the current disadvantages of using the methods, and the need for future work to overcome the limitations.

Graph Theory

Graph-theoretical methods are widely used for the representation and searching of 2D and 3D chemical structures. Graph theory is used to describe a set of objects, termed nodes or vertices, and the relationships, termed edges, between the nodes. The graphs for two objects are illustrated in .

Figure 1.

Figure 1.

Graph terminology. The nodes in the graph, I, are labeled a, b, c, d, and e, with a and b being an example of a pair of nodes that are adjacent to each other. The node set {c,d,e} in I comprises a subgraph of the graph, and this is isomorphic to the subgraph {p,q,s} in graph II. The node set {c,d,e} in I is an example of a clique since all the nodes are connected to each other; it also represents the maximum common subgraph since it is the largest set of nodes and associated edges match a corresponding subgraph (either {p,q,r} or {p,q,s}) in II.

In chemical structures, the nodes represent the atoms, and the edges represent the bonds (2D) or distances (3D) between atoms. The resulting graph, g, or connection table, contains a list of all the (nonhydrogen) atoms within the structure and their relationships to each other, in terms of bonds (2D) or distances (3D). In addition to information about the relationships between atoms in a molecule, information about the atom type itself can be stored, e.g., atom type, electrostatic charge. In earlier work, we described the application of graph theory to the representation of geospatial data used in the context of public health, in which the nodes represented geographic areas called enumeration districts (EDs), the edges represented whether two EDs were adjacent to each other, and information on deprivation and morbidity was stored for each ED.6,9 This is illustrated in , which shows the graph for ED 05CGGD10 superimposed on the map for part of the Trent region in England. ED 05CGGD10 is adjacent to EDs 05CGGD06, 05CGGD09, 05CGGD11, 05CGGD12, and 05CGGD13 and an edge is shown between the node for ED 05CGGD10 and each of the nodes for these EDs. Conversely, ED 05CGGD10 is not adjacent to the other EDs in the map and so no edge is shown between the node for this ED and the nonadjacent EDs.

Figure 2.

Figure 2.

Map of part of the Trent region with the graph for enumeration district (ED) 05CGGD10 superimposed. The nodes (black dots) represent EDs and the edges (black lines) represent EDs that are adjacent to ED 05CGGD10.

An important characteristic of a connection table is that it can be regarded as a graph, a mathematical construct that describes a set of objects, called nodes, and the relationships, called edges, that exist between pairs of the objects. Following McGregor,11 we define a graph, G, as consisting of a set of nodes V, together with a set of edges E connecting pairs of nodes (EV × V). Two nodes are referred to as adjacent if they are connected by an edge, as shown in . A labeled graph is one in which labels are associated with the nodes and/or edges; thus, a 2D connection table can be considered a labeled graph since the atoms and bonds correspond to the nodes and edges of a graph. A subgraph of G is a subset, P, of the nodes of G together with a subset, F, of the edges connecting pairs of nodes in P (PV and FP × P). Two graphs, A and B, are said to be isomorphic if they have the same structure, i.e., if there is a correspondence or mapping between the nodes of A and of B such that adjacent pairs of nodes in A are mapped to adjacent pairs of nodes in B. A common subgraph of two graphs A and B consists of a subgraph a of A and a subgraph b of B such that a is isomorphic to b: the MCS is the largest such common subgraph.

The equivalence between a labeled graph and a connection table means that connection tables may be processed using isomorphism algorithms, which are used to detect whether two graphs, or parts thereof, are identical. The most important types of algorithm in this context are for graph isomorphism, subgraph isomorphism, and MCS isomorphism. The use of graph-theoretical techniques in chemical information science has been used for the processing of 2D chemical structure diagrams and the storage and retrieval of 3D molecules.7,8

MCS Isomorphism Algorithm

An MCS algorithm allows one to determine the largest subgraph that is common to a pair of graphs. This process is extremely demanding of computational resources and belongs to the class of NP complete computational problems; a range of types of MCS algorithm has therefore been developed to accomplish MCS detection as efficiently as possible.12 In the work reported here, we used an approach based on the graph-theoretical technique known as clique detection.

A clique is a subgraph of a graph in which every node is connected to every other node and is not contained in any larger subgraph with this property. The clique detection approach to the identification of MCSs involves the identification of cliques in a correspondence graph, a data structure that contains all the possible equivalences between the two graphs that are being compared.13 Specifically, given a pair of graphs A and B, a correspondence graph, C, can be formed by the following process:

  1. Create the set of all pairs of nodes, one from each of the two graphs, such that the nodes of each pair are of the same type.

  2. Form the graph C whose nodes are the pairs from step 1. Two correspondence graph nodes (A(I),B(X)) and (A(J),B(Y)) are connected in C if the values of the edges from A(I) to A(J) and B(X) to B(Y) are the same.

  3. Maximal common subgraphs then correspond to the cliques of the correspondence graph.

The correspondence graph for graphs I and II is shown in .

Figure 3.

Figure 3.

Generation of a correspondence graph for graphs I and II. The nodes of this graph are pairs of nodes, one from graph I and one from graph II, with only the top left-hand part shown for simplicity. Considering the pairs of nodes {a,p} and {b,q} as an example, the matrix element corresponding to the possible matching of these two correspondence graph nodes is set to 1 as there is an edge connecting a to b that is equivalent to the edge connecting p to q.

Thus, the identification of the MCS for a pair of molecules is equivalent to the identification of the largest clique in the correspondence graph linking together the two molecules. In fact, as discussed below, this algorithm may be used to identify all the subgraphs in common when a pair of graphs is matched (rather than just the largest such subgraph).

It should be noted that graph theory is a completely general technology, in which the matching algorithms can be applied to any sort of data given an appropriate graph representation: We adapted the connection table to represent geographic areas called EDs. Enumeration districts are the lowest level of census geography in England and Wales representing on average 200 households in the 1991 census. The use of EDs is important because these are the smallest geographic units for which aggregated census data and public health data are available in England and Wales. In this study, EDs were represented by nodes and whether they are adjacent to other EDs was represented by the edges. Information relating to the EDs, i.e., on deprivation, morbidity, and mortality, was stored and used in the searching process to identify patterns of EDs with high levels of deprivation, morbidity, and mortality using the methods described in the following section. Our main research question was as follows: can graph theory be used to identify chains of geographic areas with high levels of deprivation, morbidity, and mortality?

Methods

Introduction

Data from the 10,665 EDs for the Trent region of England were used in this study. Geographic information systems (GISs) were used to construct databases of these basic building blocks, and the techniques outlined above were used to store and retrieve this information. A file (equivalent to a 2D chemical structure) contains a connection table in which the nodes represent the EDs (analogous to atoms) and the edges represent adjacent EDs (i.e., EDs that share a common boundary) (analogous to bonds).

Use of the RASCAL Program

The MCS isomorphism program called RASCAL (RApid Similarity CALculator)10,14 was adapted to allow it to be used with geographically based public heath data in which nodes are geographic areas and the edges are the association between these areas. The modified RASCAL program requires two distinct pieces of information about each geographic area: variable information that will be used in the selection criteria and information about which areas are neighboring. RASCAL was originally designed to match pairs of graphs representing small molecules, such as drugs or pesticides that contain a few tens of nodes (atoms) and edges (bonds). However, in the current study, we used much smaller graphs, representing areas of deprivation and poor health to match with a relatively large graph containing over 10,000 nodes, one for each of the EDs within the Trent region.

Data

Geographic Area

The geographic area used in the study was the area previously covered by the Trent Regional Health Authority, which includes South Yorkshire, Derbyshire, Leicestershire, Nottinghamshire, Lincolnshire, and South Humberside. This area is shown in . The areas of interest were the 10,665 EDs that make up the Trent region for which census data were available. Enumeration districts are the lowest level of census geography in England and Wales, representing on average 200 households in the 1991 census. Therefore, rural EDs, with relatively low population density, are quite large and clearly visible in Lincolnshire in . In contrast, urban areas, with relatively high population density, are very small and the large cities in the Trent region, e.g., Sheffield, Leicester, and Nottingham, show up as black areas due to the large numbers of very small EDs.

Figure 4.

Figure 4.

Map of Trent region showing the enumeration districts for the 1991 census.

Information about three variables was used in the study: the Townsend material deprivation index15; all-age, directly standardized long-term limiting illness per 1,000 population (DSLTLI); and all-age, all-cause, directly standardized mortality rates per 1,000 population (DSM).

Deprivation

The Townsend Material Deprivation Index was calculated for each ED within the Trent region.15 The Index is a composite score made up of the summation of four standardized variables taken from the 1991 Census Small Area Statistics (SAS). The Census variables are unemployment, overcrowding, lack of owner-occupied accommodation, and lack of car ownership. This index was chosen because previous studies have suggested that it is a reasonable measure for explaining material disadvantage.16 A high positive score indicates relatively high levels of deprivation within an area, whereas a high negative score indicates relatively high levels of affluence within an area. The Townsend Material Deprivation Index was calculated for each ED within Trent, standardized to the Trent region. A total of 195 EDs could not be allocated a deprivation score because of missing values in one or more of the census variables, generally because of low counts and suppression thresholds built into the census tables.17 These EDs were given a value of 99999 to indicate that the data were missing and were excluded from the analyses.

DSLTLI

Data on long-term limiting illness was also used from the 1991 Census SAS. The direct standardization method was used, standardizing by age using Trent region as the standard population. The ED-based population estimates used in the standardization were taken from the Estimating with Confidence Project, which adjusted for the underenumeration that occurred in the 1991 census.18

All-Age, All-Cause DSM Rates per 1,000 Population

The Office for National Statistics (ONS) record of deaths for the years 1994 to 1998 inclusive were used, together with five-year population estimates for the years 1994 to 1998 inclusive, to calculate all-cause, all-age DSM rates per 1,000 population, for each of the EDs within the Trent region. The Trent region was used as the standard population, standardizing by age and sex.

Adjacency Information

As well as each ED having a value for each of the indicators above, each ED also has information about its neighboring EDs. The EDs were each assigned a number between 1 and 10,665. For each ED, a list of neighboring ED numbers was recorded. For example, the ED code 38PMFF03 was numbered 10,000 and was adjacent to the EDs numbered 9,998, 9,999, 10,001, 10,002, 10,003, and 10,004. This information was used by the MCS algorithm to generate the connection table. A detailed description of how these data were stored for the EDs is provided elsewhere.9

Queries

Query Patterns

shows the query patterns that were used to search for chains of five EDs within the Trent region database. The purpose of these query patterns was to search for patterns of EDs in which the first ED (Q1) was adjacent to the second ED (Q2), which was adjacent to the third ED (Q3), which was adjacent to the fourth ED (Q4), which was adjacent to the fifth ED (Q5), and in which all the EDs had values for deprivation, DSLTLI, and DSM that were in the highest 25%. Thus, for each of the seven query patterns, the selection criteria to identify EDs in the highest 25% for each of the query nodes for all three selection criteria were as follows: the Townsend Deprivation score was ≥1.95, the DSLTLI level was ≥161.03, and the DSM rate was ≥13.483.

Figure 5.

Figure 5.

Query patterns used to search the Trent region database. Please note that query patterns 2 and 7 and query patterns 3 and 6 are mirror images of each other, and because the characteristics of all the nodes are the same, these pairs of patterns are topologically identical.

Query 1 is the main query pattern and was used to identify all EDs within Trent that form part of at least one chain of five EDs all within the 25% highest scores for all three indicators. One problem with the MCS algorithm is that when searching for Query 1, it not only identifies all EDs that form a linear chain of five EDs, Q1-Q2-Q3-Q4-Q5, with no additional connectivity (i.e., Q1 is not adjacent to Q3, Q4, or Q5; Q2 is not adjacent to Q4 or Q5; Q3 is not adjacent to Q1 or Q5; Q4 is not adjacent to Q1 or Q2; Q5 is not adjacent to Q1, Q2, or Q3), but it also identifies chains of five EDs, Q1-Q2-Q3-Q4-Q5, in which one or more of these additional connections, or adjacencies, do occur, e.g., Q1 adjacent to Q3. Such a pattern would not be a linear chain of EDs but a cluster of EDs, as can be seen from the diagram of Query 2 in . Queries 2 to 7 represent all possible patterns containing at least one additional connection between EDs Q1 to Q5 and form subsets of the set of patterns identified by Query 1. What we did, therefore, was to identify all the patterns and EDs retrieved using Query 1 (set 1), then identified all the patterns and EDs retrieved using Queries 2 to 7 (sets 2 to 7), and removed either the patterns (option 1, see Results below) or EDs (option 2) retrieved in sets 2 to 7 from set 1. Removing sets 2 to 7 from the set 1 left patterns in which there was a linear chain of five EDs, Q1-Q2-Q3-Q4-Q5, with no additional connectivity as described above.

As can be seen from , query patterns 2 and 7 and query patterns 3 and 6 are mirror images of each other and because they contain the same node characteristics, they are topologically identical and therefore retrieved the same patterns of EDs. Of course, if the characteristics of the query nodes were different, then different patterns would have been retrieved by query patterns 2 and 7 and query patterns 3 and 6.

The modified RASCAL program was used to search for each of these queries in the Trent region data file.

Results

shows the number of patterns identified by each of the query patterns used to search the Trent region database. It can be seen from that the number of patterns and EDs ranged from 540 and 63, respectively (query pattern 4), to 3,848 and 195 (query pattern 1, of which all the patterns identified from the other query patterns formed subsets). The number of EDs selected is much less than the number of patterns identified because of the clustering of EDs in larger areas of deprivation, so that within a given area, there may be many combinations, or patterns containing the same few EDs.

Table 1.

Number of Patterns and EDs Identified by each of the Seven Query Patterns

Query Pattern No. No. of Patterns Identified No. of EDs Selected
1 3,848 195
2, 7 1,492 174
3, 6 662 115
4 540 63
5 1,232 151

EDs = Enumeration Districts.

We tried two methods of removing the results from query patterns 2 to 7 from the results for query pattern 1: option 1, which removes all the patterns that were identified using queries 2 to 7 (n = 1,492, 662, 540, 1,232, 662, and 1,492, respectively) from the list of patterns identified in query 1 (n = 3,848); and option 2, which removes all the EDs that were identified using queries 2 to 7 (n = 174, 115, 63, 151, 115, and 174, respectively) to produce a list of EDs that are identified within patterns identified using query 1 only, i.e., which form a chain of five EDs. When option 1 was used, there were 704 patterns remaining, which contained a total of 161 EDs. When option 2 was used, only 12 EDs remained; as is demonstrated in the examples given below, only five of these 12 EDs were located in a chain (). The two options produced different results because of the overlap of different patterns in areas containing many deprived EDs: option 1 removed those EDs that are found in the patterns using queries 2 to 7 if, and only if, the EDs in those areas are not also part of a pattern that contain only the edges shown in query 1. Option 2, on the other hand, removed any EDs from the patterns selected using query 1 that were in any of the patterns selected using queries 2 to 7 and only retained those EDs that were found exclusively in query 1. This is illustrated in the examples described below.

Figure 6.

Figure 6.

Maps showing enumeration districts (EDs) selected after running queries 1 to 7 and the remaining EDs for options 1 and 2 for area 1.

show close-ups of three areas (1, 2, and 3) within the Trent region showing a selection of results from queries 1 to 7 and when the EDs remaining when options 1 and 2 were used to remove query patterns and EDs, respectively, from the results from query 1. It should be noted that in each area, the highlighted areas often contain more than five EDs: This occurs because there might be several chains of five EDs that are contiguous or overlap with each other within an area.

Figure 7.

Figure 7.

Maps showing enumeration districts (EDs) selected after running queries 1 to 7 and the remaining EDs for options 1 and 2 for area 2.

Figure 8.

Figure 8.

Maps showing enumeration districts (EDs) selected after running queries 1 to 7 and the remaining EDs for options 1 and 2 for area 3.

It can be seen from that in area 1 of the Trent region, query 1 identified an area in the upper left quadrant that contained one chain of five EDs of high deprivation, morbidity, and mortality and an area in the lower right quadrant that contained several chains of five EDs of high deprivation, morbidity, and mortality. It can also be seen from that when queries 2 to 7 were used, they each identified the area of high deprivation, morbidity, and mortality in the lower right quadrant, but none of them identified the area of high deprivation, morbidity, and mortality in the upper left quadrant. However, when the patterns identified using queries 2 to 7 were removed using option 1, not only did the patterns within the area in the top left quadrant remain, but so did the patterns within the bottom right quadrant. This occurred because, although it is possible to identify patterns of queries 2 to 7 in the area in the in bottom right quadrant, it is also possible to identify patterns made of the nodes in query 1 that contain only the edges shown in query 1. When the EDs identified using queries 2 to 7 were removed using option 2, none of the EDs in the area of high deprivation, morbidity, and mortality in the lower right quadrant remained, and all the EDs in the area of high deprivation, morbidity, and mortality in the upper left quadrant remained.

It can be seen from that in area 2 of the Trent region, query 1 identified an area that contained several chains of five EDs of high deprivation, morbidity, and mortality. When queries 2 to 7 were used, they each identified the area of high deprivation, morbidity, and mortality, although different queries identified different chains within that area, as is evinced by the small variations in the EDs identified for these queries. When the patterns identified using queries 2 to 7 were removed using option 1, the patterns within the area remained, again because it is possible to identify patterns made of the nodes in query 1 that contain only the edges shown in query 1. When the EDs identified using queries 2 to 7 were removed using option 2, only one ED remained.

It can be seen from that in area 3 of the Trent region, query 1 identified several areas that contained one or more chains of five EDs of high deprivation, morbidity, and mortality. Queries 2, 3, 5, 6, and 7 each identified very different chains within these areas, although query 4 did not identify any chains in area 3. When the patterns identified using queries 2 to 7 were removed using option 1, almost all the patterns within area 3 remained due to the patterns made of the nodes in query 1 containing only the edges shown in query 1. When the EDs identified using queries 2 to 7 were removed using option 2, no EDs remained within area 3.

Discussion

The overall aim of this study was to use the graph-theoretical techniques to search for geographic patterns of deprivation and poor health in a large public health database. In the work described in this paper, we were seeking to isolate patterns that consisted of chains of EDs from those that were located in clusters. This study demonstrates the capacity of the modified RASCAL program, and the MCS algorithm to identify patterns of relatively high deprivation, morbidity, and mortality. Manually identifying such areas within a large geographic area containing such a large number of EDs by checking each and every individual area by hand would be extremely laborious and resource intensive, if indeed it were possible.9 When run on a Silicon Graphics parallel processor (2 × 225 MHz; 2 × 180 MHz), the RASCAL program took typically up to 30 seconds to search for the query patterns used in this particular study, although the processing time is considerably longer, i.e., hours, for the larger, more complex queries in previous work.9 Although the procedure is potentially computationally intensive, depending on the complexity of the query pattern, using the MCS algorithm for this task followed by manual inspection and interpretation of the output makes the identification of such patterns possible.

The identification of deprived areas with high prevalence of chronic illness and high mortality is of potential value within public health for allocation of extra resources for health and social care services to meet local needs and to also to try to improve health and well-being in local populations. This information is also of value to municipal authorities to target resources for improving housing conditions and the living environment. While current methods may be able to identify areas with such characteristics, RASCAL is able to retrieve differently shaped patterns of EDs that could be associated with specific geographic features.

The MCS algorithm could be used to search for both general and more specific patterns of health and well-being in geographic databases assuming that the required data are available for geographic units and that data on the proximity of these units to each other are available. In this study, we used data on the adjacency of EDs in searching for patterns, based on 2D searching algorithms. Similar 3D searching algorithms from computational chemistry could be adapted to conduct searches based on physical distances between geographic units, e.g., distances between centroids.

We were interested in identifying groups of deprived EDs with poor health. The attributes that were included in the queries involved only three variables, the Townsend Index, levels of long-term limiting illness, and mortality rates, and were set at the same levels for each node (ED) to identify the EDs with the highest rates for these attributes. These represent relatively simple patterns, both in terms of the number of attributes and the levels at which they were set, and do not make full use of the capacity of the MCS algorithm, which can search among as many as 20 attributes assigned to a node. The number of attributes could be increased to include other measures of deprivation, health, or quality of life, assuming such data are available. For example, instead of using the composite index of deprivation, i.e., the Townsend Index, the actual variables used to calculate the index, i.e., levels of unemployment, overcrowding, lack of owner-occupied accommodation and lack of car ownership, as well as those from other deprivation indices could be included. Including a broader range of indicators of deprivation might help to identify the most extremely deprived areas or those with more specific user-defined characteristics. Similarly, although using levels of long-term limiting illness and mortality rates was convenient because the data were easily available, mortality rates especially are a relatively crude indicator of population health, and using disease-specific prevalence or incidence data could be more useful for planning and developing health and social care services to meet the precise needs of local communities.

Although the queries used in this study appear relatively simple in structure, they are actually quite complex for searching as they involve several nodes with varying degrees of connectivity between them. While GISs that can search topology could possibly have identified the patterns, these would have required separate complex computer programs for each individual query. The advantage of our graph theory-based approach is that, although the program itself is very large and complex, it was relatively straightforward to adapt the program to the adjacency and attribute data in the Trent region database. Once this had been completed, all that was required was for the query pattern to be set up in a query file.9

While previous work has demonstrated that it is possible to identify clusters of EDs,9,19 we were particularly interested in determining whether we could identify chains of deprived EDs with high morbidity and mortality, which might have potential value in public health/planning for identifying problems associated with geographic features such as rivers or major roads. Although query 1 in this study appears as a chain of EDs, when searching for this query, the MCS algorithm will identify and retrieve groups of EDs with the attributes of the query nodes that have connections, e.g., between nodes 2 and 4, additional to those specified in the query, i.e., between nodes 1 and 2, nodes 2 and 3, nodes 3 and 4, and nodes 4 and 5, as demonstrated in . In effect, the algorithm is unable to differentiate between chains of EDs and those that are arranged in clusters. To overcome this problem, we tried two approaches that involved running searches for queries that had at least one connection in addition to those specified in query 1. These queries (2 to 7) would also identify series of EDs that had other connections in addition to those specified.

The results of these searches were then used to remove those patterns (option 1) and EDs (option 2) retrieved by queries 2 to 7. We then examined a selection of the remaining patterns and EDs to evaluate qualitatively options 1 and 2, respectively. The results suggest that, using these options, it is possible to reduce the number of patterns and EDs retrieved quite considerably (from 3,848 to 704 patterns and from 195 to 12 EDs). As could be seen from the results, option 2 removed the large cluster in area 1 while retaining the chain (), option 1 retained both the chain and cluster observable in area 2 while option 2 removed all but one ED entirely (), and option 1 retained observable chains within larger clusters in area 3 while option 2 removed all the EDs (). This suggests that option 1 was useful in that it retained chains but also retained some clusters and that option 2 removed clusters but also removed observable chains, and only five of the remaining 12 EDs were located in a chain. Although these methods may have had limited success, used together, they might be useful in narrowing down the number of patterns retrieved by excluding some, but not all, clusters of EDs. These results suggest that option 1 would have a higher sensitivity than option 2, but also that it would have a higher number of false-positive EDs, i.e., EDs that were not exclusive to query pattern 1. However, because of the areas of deprivation containing large numbers of deprived EDs and the subsequent overlap of patterns of EDs, it is not possible to quantify the sensitivity and specificity of the two options. This is because areas of deprivation tend to be larger than the census areas of approximately 200 households (i.e., EDs), then groups of deprived EDs appear together in clusters,6 and illustrates the complex problem of searching for specific patterns of deprivation. This means that the final sets of patterns should be viewed manually to identify chains of EDs and any geographic features associated with them. Manual viewing in this way is important in interpreting results; what this study shows is that it is possible to reduce the magnitude of this task to a manageable level using the RASCAL program. To summarize, option 1 was more useful in that it reduced the number of ED patterns that had to be examined manually to a manageable level, while still identifying chains of EDs, whereas option 2 removed too many EDs, and potential chains of EDs were no longer apparent. The next phase of this research is to incorporate information about actual geographic features to allow more specific searching and to help determine the presence of any features that may be associated with the different patterns of morbidity and deprivation.

An additional problem with the results in the current format is that all the retrieved patterns and EDs are presented together, with chains overlapping. This means that although some clusters were removed by option 1, other clusters remained because there were several chains present that were contiguous. Another possible way of overcoming the problem of differentiating between chains and clusters would be to present each retrieved pattern separately and manually trying to identify chains as outlined above: While this would not be feasible for all the patterns retrieved using query 1, this might be possible following removal of the results of queries 2 to 7 using option 1. Differentiating between chains and clusters of deprivation/poor health is of potential value to public health planners for identifying different types of geographic features that may be associated with these problems, e.g., power stations or landfill sites (clusters) and polluted rivers or major roads (chains). Further research will seek to develop algorithms more tailored for searching for patterns of deprivation, morbidity, and mortality, to explore the possibility of incorporating information about local geographic features into the attributes for each node (i.e., ED) and information on physical distances between EDs, and use these as additional search criteria to make a more refined search.

Conclusion

The study has demonstrated the utility of an MCS algorithm for identifying areas of high deprivation, mortality, and morbidity. Although using additional searches to try to differentiate between discrete clusters of EDs and chains of EDs was of limited success, the study demonstrates the potential of the algorithm for identifying areas of need in relation to the geographic environment for public health planning. The paper discussed the drawbacks of the program in its current state and outlined the need for future work to develop this method further to help with the identification of patterns in public health data sets.

The authors acknowledge the Medical Research Council for funding this study under the Discipline-Hopping program. The authors thank Peter Fryers and Paul White for providing the data on the enumeration districts.

References

  • 1.Alexander FE, Cuzick J. Methods for the assessment of disease clusters. In: Elliott P, Cuzick J, English D, Stern R, editors. Geographical and environmental epidemiology. Oxford: Oxford University Press, 1992, 238–50.
  • 2.Openshaw S, Craft AW, Charlton H, Birch JM. Investigation of leukaemia clusters by use of a geographical analysis machine. Lancet. 1988;1:272–3. [DOI] [PubMed] [Google Scholar]
  • 3.Knox EG. Detection of clusters. In: Elliott P, editor. Methodology of enquiries into disease clustering. London: Small Area Health Statistics Unit, 1989.
  • 4.Besag J, Newell J. The detection of clusters in rare diseases. J R Stat Soc Series A. 1991;154:143–55. [Google Scholar]
  • 5.Kulldorff M. Spatial scan statistics: models, calculations and applications. In: Glaz J, Balakrishnan N, editors. Scan statistics and applications. Boston: Birkhauser, 1999, p. 303–22.
  • 6.Bath PA, Craigs C, Maheswaran R, Raymond J, Willett P. Validation of graph-theoretical methods for pattern identification in public health datasets. Health Inform J. 2002;8:167–73. [Google Scholar]
  • 7.Gasteiger J, Engel T, editors. Chemoinformatics. A textbook. Weinheim: Wiley-VCH, 2003.
  • 8.Leach AR, Gillet VJ. An introduction to chemoinformatics. Dordrecht: Kluwer, 2003.
  • 9.Bath PA, Craigs C, Maheswaran R, Raymond J, Willett P. Pattern identification in public health data sets: the potential offered by graph theory. Innovations in GIS. Taylor & Francis, Inc., Abingdon, UK. (in press).
  • 10.Raymond JW, Gardiner EJ, Willett P. RASCAL: Calculation of graph similarity using maximum common edge subgraphs. Comput J. 2002;45:631–44. [DOI] [PubMed] [Google Scholar]
  • 11.McGregor JJ. Backtrack search algorithms and the maximal common subgraph problem. Software Pract Exp. 1982;12:23–34. [Google Scholar]
  • 12.Raymond JW, Willett P. Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J Comput Aided Mol Design. 2002;16:521–33. [DOI] [PubMed] [Google Scholar]
  • 13.Levi G. A note on the derivation of maximal common subgraphs of two directed or undirected graphs. Calcolo. 1972;9:341–52. [Google Scholar]
  • 14.Raymond JW, Gardiner EJ, Willett P. Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J Chem Inform Comput Sci. 2002;42:305–16. [DOI] [PubMed] [Google Scholar]
  • 15.Townsend P, Phillimore P, Beattie A. Health and deprivation: inequality and the North. London: Croom Helm, 1988.
  • 16.Morris R, Carstairs V. Which deprivation? A comparison of selected deprivation indexes. J Public Health Med. 1991;13:318–26. [PubMed] [Google Scholar]
  • 17.Dale R, Marsh C. The 1991 Census user's guide. London: HMSO, 1993.
  • 18.Simpson S, Tye R, Diamond I. What was the real population of local areas in 1991? Working paper 10. Estimating with Confidence Project. Southampton: Department of Social Sciences, University of Southampton, 1995.
  • 19.Bath PA, Craigs C, Maheswaran R, Raymond J, Willett P. Use of graph theory for data mining in public health. Data Mining III. In: Zanasi A, Brebbia CA, Ebecken NF, Melli P, editors. Proceedings of the Third International Conference on Data Mining. Southampton: WIT Press, 2002, p. 819–28.

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES