Abstract
Secondary lymphedema (LE) is a chronic progressive disease often caused by cancer treatment, especially in patients who require surgical removal of or radiation to lymph nodes. While LE is incurable, it can be managed successfully with early detection and appropriate treatment. Detection and prediction of LE is difficult due to the absence of a “gold standard” for diagnosis. Despite this, management of the disease is accomplished through adherence to a set of guidelines developed by experts in the field. Unfortunately, not all the recommendations in such a document are supported by clear research evidence, and most of them are only based on expert judgment with limited evidence. This paper focuses on developing a new algorithm to extract specific association rules from LE survey data and efficiently index the rules for easy knowledge retrieval, with the ultimate goal discovering evidence-based and relevant knowledge for inclusion into the best practice document (BP) for the LE community.
Introduction
Recent statistical data show that breast cancer has the highest survival rate among all cancers at 89%1. There are an estimated approximately 2.4 million breast cancer survivors in the US2. Lymphedema is a threat to this large population of breast cancer survivors, since their cancer treatment likely included surgery or radiation treatment. As these treatments may adversely affect the lymphatic system, breast cancer survivors are at life-time risk of developing lymphedema (LE)3.
Even though LE management has been significantly improved through the efforts of individual research institutions and organizations such as the American Lymphedema Framework Project (ALFP), a national initiative created to increase awareness of LE, develop consensus guidelines for the best practice for LE management in the US, and design and build a national minimum dataset based on patients with LE, many patients are either unaware of available treatment modalities and centers or receive inadequate treatment. A recent review indicated the obvious lack of evidence for standard LE management4. As with many diseases, the best chance of managing and controlling LE is through early detection and initiation of appropriate treatment. Once LE progresses past the early stages, whether that is due to a patient’s unawareness of or indifference to LE or perhaps the result of undetected early signs and symptoms, LE becomes increasingly difficult to control. For this reason, awareness and early detection are of utmost importance.
Even though there are many isolated electronic resources from different informational LE organizations such as the American Cancer Society (ACS), National Cancer Institute (NCI), National Lymphedema Network (NLN), and the International Lymphoedema Framework (ILF), a comprehensive informatics tool to bind resources together is missing. This informatics tool could provide the LE community with the foundation for evidence-based consensus-supported guidelines for LE management by using as much LE data as possible from different sites and sources. In order to predict LE more accurately, we need to explore as many hidden relationships as possible from LE data. Associative mining technique can be applied to this data in an effort to discover interesting and clinically-relevant findings to support, negate, or expand the BP document5, with the latter opening the door to further research for evidence-based practice in this area.
Even though several associative mining algorithms have been created, such as Frequent Pattern Tree (FPT)6 and Apriori7, they share some common problems: memory exhaustion with large datasets or a low minimum support threshold. Also, since most associative mining algorithms provide findings only for frequent itemsets, users have to perform post processing in order to extract association rules (ARs.) To tackle this issue, we develop a novel approach to: 1) Improve mining efficiency by using a graph structure; 2) Generate a confidence graph that includes ARs automatically; and 3) Match specific AR results directly based on different user queries.
Method
Dataset for Lymphedema Patients
Our symptom data come from the LE and Breast Cancer Questionnaire (LBCQ), which was developed, piloted, revised, and validated by Armer et al8. The LBCQ is a structured interview or self-report tool to assess a patient’s profile including lymphedema symptoms. For this study, we focus on a series of three yes/no questions that are asked for each of the 14 LE symptoms: Have you experienced this symptom recently (last 30 days)? In the past year (12 months)? Have you taken any action to manage this symptom? The binary nature of these questions makes them very suitable for analysis with data mining techniques.
This survey, administered by the nursing research staff, is given to our patients as an interview at each lab visit T0–T13 within 60 months, as shown in Figure 1. Our temporal dataset currently consists of 376 patients. Our reported findings use symptom data only from the LBCQ.
Figure 1.
Timeline for data collection, where Ti represents the ith visit of a patient9.
Preprocessing the Data
As usual, the data must be preprocessed before applying a data mining algorithm. Specifically, we must assign a unique value to each answer for each question at each visit, for a total of 2*42*14=1,176 values, which is a product of the following numbers: number of yes/no options (2), number of questions (42), and number of visits (14); each of these will be an item. Because only one answer is permitted for each question, the data are screened to verify this for each patient. We recorded answers for all patients. However, if certain questions were not answered by the patient, the items associated with those questions were not considered in the data mining process.
Data Mining Algorithm
Associative mining is a data mining method to obtain association rules which can be used to explore unknown and potentially interesting relationships in the data. An association rule R can be written as R: A→B, where A, B are disjoint itemsets. Consider the following example in the LE domain: If A corresponds to having swelling at T0 and B to having tenderness at T0, then the rule R means that one can say with a certain confidence value that if a patient has swelling at post-op (T1), the patient will also have tenderness at post-op (T1). Before we get into any conclusion of this rule, the set {A, B} must first be a frequent itemset, which means that the frequency of swelling and tenderness happening together at T0 should exceed a threshold called minimum support. The support threshold of R and is defined as:
where count (A,B), for example, is the frequency of swelling and tenderness co-occurring at T0. We also need to calculate the confidence of rule R to help decide whether R is useful and meaningful:
If a rule’s confidence exceeds a threshold called the minimum confidence, then we can say A→B is a relevant AR. For the same frequent itemset (A,B), a “reverse” confidence is also calculated for B → A. In our work, we perform the following data mining steps:
Step 1: Sort all the items of each record in order.
As example, consider questions from two symptoms (six questions) from the first visit (T0), i.e. 12 different items: i1 to i12, as well as example responses from our four patients. After sorting, the records would appear as below:
Record 1: {i2, i3, i6}; Record 2: {i1, i4};
Record 3: {i2, i4, i8}; Record 4: {i2, i4, i7, i10}
Step 2: Build a partial-support tree10, shown in Figure 2, using the sorted records to calculate support values.
Figure 2.
A partial-support tree. Each node is an itemset including one or more items with its partial support value.
Step 3: Based on the tree, obtain the support value of all possible itemsets and compare them with minimum support. Store only these itemsets that are frequent with their support values.
Step 4: Generate a confidence graph (see Figure 3) using the stored frequent itemsets and calculate confidence value based on the support value from last step by doing all possible combinations. The links in this graph are generated based on one rule: no duplicate item(s) in two end-nodes.
Figure 3.
A confidence graph. Each node stores an itemset and its support value. Each link stores the support value of the itemset formed by combining items from the two end-nodes as well as the confidence value and reverse confidence value of the two symmetric ARs.
Each node in the graph represents one itemset. An edge is placed between two nodes if and only if the corresponding itemsets contain no duplicate items. Thus, each edge in the graph represents two symmetric association rules. As an illustration, consider the edge between item i2 and i3. This edge represents the two symmetric ARs: i2→i3 and i3→i2, each of which has its own confidence value.
Management of Rules in a Relational Database
In order to make searching for relevant ARs more efficient, we index the results including all information of confidence graph and support values for all itemsets in a relational database (RDB) (see Figure 4) based on the graph structure. The relational database scheme is designed using the topological structure of the graph by setting unique values as different ids for the various levels in the graph. Such a setting accelerates the search process especially when the graph is large. Because even a small dataset is likely to generate hundreds of thousands of rules, we perform tuning of the database by studying query processes for the LE community and utilizing indices for efficient retrieval of desired ARs.
Figure 4.
The ER diagram for mapping confidence graphs into a relational database for fast retrieval of interesting ARs.
In this RDB, we have four tables. “graph_id” is the identification for different graphs based on different “Minimum_Support” values; “level_id” is the identification for different levels of each graph, whose value is the first item in this level. As the example above, the value of “level_id” for the first level is i1, since in that level, all item sets start with i1. “itemset_id” is the identification for different itemset whose value represents all items in this itemset.
Match specific ARs
After generating the confidence graph and storing the results in the RDB, we can easily match meaningful ARs based on different requirements. If we don’t know which set of symptom attributes is relevant, we have to look into all possible association rules which would be a huge number of rules. Or if we have some relevant attributes for further hypothesis testing, we only want to see some special rules related to those attributes. Instead of looking into the whole graph and searching for some small number of rules among a huge number, using this graph structure algorithm, we can extract sub-graph including all relevant attributes connected with other nodes by all the links.
For example, suppose a patient has symptoms of items i1 and i4. The relevant ARs can be obtained by performing the following tasks:
Task1: Match ARs corresponding to itemsets that contain all the relevant attributes (both i1 and i4). The system performs the following processes: (1) compare item “i1” with all “level_id”; (2) find which level to go into; (3) find the same “item_id” representing the item set “i1, i4” in that level; and (4) stop and extract the sub-graph (Figure 5). This can be accomplished by executing the following division SQL query.
Figure5.
Example for extracting sub-graph including interesting items “i1, i4” only with all possible ARs.
SELECT * FROM confidence WHERE level_id = i1
AND antecedent in (SELECT itemset_id FROM itemset i
WHERE NOT EXISTS (SELECT *
FROM (SELECT distinct item FROM itemset) a
WHERE item IN (i1, i4) AND NOT EXISTS
(SELECT * FROM itemset i2
WHERE i2.itemset_id = i.itemset_id
AND i2.item = a.item)));
Task 2: Match ARs corresponding to itemsets that contain a subset of the relevant attributes (i1 or i4). The system performs the following processes: (1) search “level_id”=i1 and get all information under that level from the “Confidence” table; (2) search “level_id”=i4 and get all information under that level from the “Confidence” table; and (3) combine them together; extract the sub-graph. These ARs can be retrieved with a very simple SQL query.
SELECT * FROM confidence
WHERE level_id = i1 OR level_id = i4;
Task 3: Match ARs with the relevant attributes as the consequent (rules of the form A → {i1, i4}). This task uses the same search step as task 1, but outputs the only those reverse ARs with reverse confidence value on the link.
Using this setting, it is efficient to search ARs under specific conditions by using “level_id”. Also, since all the items in the itemsets are sorted, it is very fast for different types of searches. Not only are the items in the nodes sorted, but the levels are also sorted. Therefore, if one wants to find any item set with its association rules, we can follow the flow chart shown in Figure 6.
Figure 6.
The flow chart for searching specific association rules.
As mentioned previously, most associative algorithms stop at finding frequent itemsets. This requires extra efforts to find ARs based on the frequent itemset. Thus, as the size of itemsets increases, the number of ARs increases rapidly. For very large frequent itemsets, finding ARs becomes a combinatorial problem and will be very time-consuming. Besides writing code to generate ARs, alternatively people may go to a domain expert and ask the expert to find the relevant ARs. Realistically, though, an expert is not going to go through all the ARs if there are thousands of generated ARs. Instead, some ARs may be randomly selected, meaning that some relevant ARs are missed. This potential of domain experts to miss or overlook relevant ARs is the reason to use the proposed algorithm. It is also more efficient to extract information and will not miss any potential relevant ARs.
Results and Discussion
To demonstrate improvement of efficiency by using our approach, we first store all ARs in two different RDBs: one uses our graph structure; the other one doesn’t. In order to make a fair comparison, the two RDBs have the same set of attributes, except the graph one has one more column called “level_id” representing topological properties of the graph structure. We perform the same query, listed in the previous page, on both RDBs and use “SHOW PROFILE BLOCK IO” to obtain the counts of block input and output operations. This measurement reflects the cost to perform rule matching. We then apply one-way ANOVA to test the null hypothesis: “H0: μ1 = μ2” for average I/O counts between two structures. The null hypothesis is rejected with F(1,748)=181.5 and p<.0001. Also, graph structure group (μ1=161,402) is more than ten times faster than the other group (μ2=1,833,293) using data from 376 patients. The potential scaling issues include numbers of patients and survey items/options. The former issue normally does not cause dramatic change in number of rules. However, the latter issue could cause combinatorial explosion if the number of survey items/options grows.
This graph structure allows us to perform AR match for different patient’s conditions. For example, in the BP document, swelling is described as an early sign of lymphedema and might be accompanied by heaviness, tightness, stiffness, and aching. The following results were obtained using the procedure described in Figure 6 for patients with certain combination of symptoms. (Minimum Support = 0.1)
Task1: Match relevant ARs with {“having swelling now at T1”, “having tenderness now at T1”} together:
R1: “Having swelling AND tenderness now at T1” → “Having firmness/tightness now at T1” (Confidence = 0.802);
R2: “Having swelling AND tenderness now at T1” → “Having numbness now at T1” (Confidence = 0.787)
R3: “Having swelling AND tenderness now at T1” → “Having numbness now at T2” (Confidence = 0.787)
R4: “Having swelling AND tenderness now at T1” → “Having numbness now at T3” (Confidence = 0.787)
R1 can give evidence-based support for the BP document’s claim that “Swelling always goes along with firmness/tightness”, while R2, R3, and R4 can provide additional information: Swelling and tenderness do not only go along with tightness, but also with numbness. And, there is a more than 78% chance that the numbness won’t go away until the T3 visit if the patient has swelling and tenderness at T1.
We also study relevant ARs with {“Having firmness/tightness now at T1”, “Having tenderness now at T1”, “Having numbness now at T1”} together:
R5: “Having firmness/tightness AND tenderness AND numbness now at T1” → “Having tenderness now at T2” (Confidence=0.663)
R6: “Having firmness/tightness AND tenderness AND numbness now at T1” → “Having tenderness now at T3” (Confidence=0.631)
R7: “Having firmness/tightness AND tenderness AND numbness now at T1” → “Having tenderness now at T4” (Confidence=0.563)
From R5-R7, if a patient is having firmness/tightness and tenderness and numbness at T1, there is a more than 50% chance that the patient will continue to have tenderness until the T4 visit.
Task 2: Match all ARs related to {“having swelling now at T1” or “having tenderness now at T1”}. The results should include all the ARs from task 1 and also other ARs which only contain one of these items as a subset.
R8: “Having tenderness now at T1” → “Having tenderness now at T2” (Confidence=0.561)
R9: “Having tenderness now at T1 and T2” → “Having tenderness now at T3” (Confidence=0.703)
From R8 and R9, we can provide evidence to support the BP document by saying that having tenderness at T1 corresponds to increased chances of still having tenderness at future visits. This is why tenderness is considered an early sign for developing lymphedema.
Task 3: Match all ARs with {“having swelling now at T1”, “having tenderness now at T1”} in the consequent. The ARs should be the same as task 1, but with only Reverse Confidence values.
We also divide patients’ data into four groups based on their BMI: Underweight (<18.5); Normal weight (between 18.5 and 24.9); Overweight (between 24.9 and 29.9); Obesity (≥29.9). Since there are only 12 patients in our data are underweight, so, we exclude them from the analysis. We have results below:
Normal weight: R10: “Having heaviness at T1” → “Having tenderness at T2” (Confidence=0.57)
Overweight and Obesity: R11: “Having heaviness at T1” → “Having tenderness at T2” (Confidence=0.671).
From R10 and R11 we can say, the higher the BMI, the more chance the patient will have tenderness later, when a patient has heaviness at T1. It matches the recommendation of “maintain healthy weight” from BP document.
Also, in our overweight and obesity groups, around 30% patient having redness at T1, we find an AR in those two groups:
R12: “Having redness at T1” → “Having redness at T2” (Confidence=0.4), which means those patients have a 40% chance to have redness till T2.
In the following findings, we study the relevant ARs associated with stiffness.
Normal weight: R13: “Having stiffness at T1” → “Having numbness at T2” (Confidence=0.758)
R14: “Having stiffness at T1 and numbness at T2” → “Having numbness at T3” (Confidence=0.865)
Overweight and Obesity: R15: “Having stiffness at T1” → “Having numbness at T2” (Confidence=0.65)
From R13–R15, we can say normal weight patients are more likely to have numbness at later visits than those who are in the overweight and obesity categories, if the patient has stiffness at T1.
By using this sorted graph structure associative mining algorithm, the entire mining results can be stored in a graph structure, which facilitates more efficient matching of ARs, even if the candidate AR pool is quite large. The approach provides useful information for clinicians in identifying patients who may be candidates for interventions to reduce the risk for development of LE and for early detection leading to more optimal treatment outcomes.
By using the discovered ARs, evidence-based knowledge for LE treatment and management can be shared by the international community to conduct further studies across continents by linking partner’s data together, thereafter used to refine the BP document.
Although our data is from LBCQ, our approach can also be applicable for other general diseases, particularly for chronic diseases that have repetitive measurements over time.
Acknowledgments
This project is currently supported by the ALFP and National Institute for Nursing Research grant numbers R01 NR05342-01 and R01 NR010293-01A2. The authors would like to thank Drs. Jane M. Armer, Bob R Stewart, and the R01 and ALFP staff for data collection and many in-depth discussions.
References
- 1.American Cancer Society Cancer Facts & Figures. 2008.
- 2.Reis L, et al. In: SEER cancer statistics review. N.C. Institute, editor. Bethesda, MD: 2005. [Google Scholar]
- 3.Armer JM, Stewart BR. A Comparison of four diagnostic criteria for lymphedema in a post-breast cancer population. Lymphatic Research and Biology. 2005;4(3) doi: 10.1089/lrb.2005.3.208. [DOI] [PubMed] [Google Scholar]
- 4.Badger C, Preston N, Seers K, Mortimer P. Physical therapies for reducing and controlling lymphedema of the limbs. Cochrane Database Syst Rev. 2004;4:CD003141. doi: 10.1002/14651858.CD003141.pub2. [DOI] [PubMed] [Google Scholar]
- 5.Best Practice for the Management of Lymphoedema. International consensus. London: MEP Ltd; 2006. Lymphoedema Framework. [Google Scholar]
- 6.Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Mining and Knowledge Discovery. 2004;8:53–87. [Google Scholar]
- 7.Agrawal R, Srikant R. Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases (VLDB) 1994:487–99. [Google Scholar]
- 8.Armer JM, Whitman M.The problem of lymphedema following breast cancer treatment: prevalence, symptoms, and management Lymphology 200235153–159.12570324 [Google Scholar]
- 9.Mahamaneerat WK, Shyu CR, Stewart BR, Armer JM. Breast cancer treatment and post-op swelling, lymphedema. Journal of Lymphoedema. 2008;3(2) [PMC free article] [PubMed] [Google Scholar]
- 10.Goulbourne G, Coenen F, Leng P. Algorithms for computing association rules using a partial-support tree. Knowledge-Based Systems. 2000;13:141–149. [Google Scholar]






