SymptomGraph: Identifying Symptom Clusters from Narrative Clinical Notes using Graph Clustering

Fattah Muhammad Tahabi; Susan Storey; Xiao Luo

doi:10.1145/3555776.3577685

. Author manuscript; available in PMC: 2023 Sep 16.

Published in final edited form as: Proc Symp Appl Comput. 2023 Jun 7;2023:518–527. doi: 10.1145/3555776.3577685

SymptomGraph: Identifying Symptom Clusters from Narrative Clinical Notes using Graph Clustering

Fattah Muhammad Tahabi ¹, Susan Storey ², Xiao Luo ³

PMCID: PMC10504685 NIHMSID: NIHMS1930528 PMID: 37720922

Abstract

Patients with cancer or other chronic diseases often experience different symptoms before or after treatments. The symptoms could be physical, gastrointestinal, psychological, or cognitive (memory loss), or other types. Previous research focuses on understanding the individual symptoms or symptom correlations by collecting data through symptom surveys and using traditional statistical methods to analyze the symptoms, such as principal component analysis or factor analysis. This research proposes a computational system, SymptomGraph, to identify the symptom clusters in the narrative text of written clinical notes in electronic health records (EHR). SymptomGraph is developed to use a set of natural language processing (NLP) and artificial intelligence (AI) methods to first extract the clinician-documented symptoms from clinical notes. Then, a semantic symptom expression clustering method is used to discover a set of typical symptoms. A symptom graph is built based on the co-occurrences of the symptoms. Finally, a graph clustering algorithm is developed to discover the symptom clusters. Although SymptomGraph is applied to the narrative clinical notes, it can be adapted to analyze symptom survey data. We applied Symptom-Graph on a colorectal cancer patient with and without diabetes (Type 2) data set to detect the patient symptom clusters one year after the chemotherapy. Our results show that SymptomGraph can identify the typical symptom clusters of colorectal cancer patients’ post-chemotherapy. The results also show that colorectal cancer patients with diabetes often show more symptoms of peripheral neuropathy, younger patients have mental dysfunctions of alcohol or tobacco abuse, and patients at later cancer stages show more memory loss symptoms. Our system can be generalized to extract and analyze symptom clusters of other chronic diseases or acute diseases like COVID-19.

Keywords: Symptom Clusters, Graph Neural Networks, Clinical Notes, Graph Clustering, Electronic Health Records

1. INTRODUCTION

Electronic Health Records (EHRs) have become a standard tool used in healthcare delivery. Although the original objective of EHRs was to store patient records, the ever-growing clinical data in the EHRs enables artificial intelligence (AI)-based clinical data mining and decision support applications. The heterogeneous EHR data are available in both structured and unstructured clinical notes. The diagnosis, medication, and lab tests are often recorded in the structured fields of the EHR, whereas the symptoms and clinical evidence are often written in the clinical notes. Indeed, much clinical information is written in the clinical notes, making it valuable to analyze to understand the clinical information and their associations to support clinical research and build clinical decision support systems.

In nursing science, symptom clusters are defined as two or more symptoms that are related and that occur together[25]. Symptom understanding and discovery of symptom clusters of chronic diseases, such as cancer, are important for nursing science to enhance scientific knowledge, facilitate communication and improve patient outcomes. Because of the complexity of clinical notes mining, the typical symptom research analyzes symptoms through patient surveys [5, 31, 43]. Limitations of the patient survey approach are that it can be costly and challenging to collect a large amount of data from a large population as well as these surveys might not capture the various symptoms experienced in real-time by the patients and may cause an additional burden for patients to complete. In the EHR clinical notes, health care clinicians document clinical diagnosis, medication, and symptoms during a clinical encounter. These symptoms are documented in real clinical scenarios. If these clinician-documented symptoms can be extracted and further analyzed for nursing and clinical research, it will greatly enhance the understanding of symptom prevalence in different diseases and patient populations. Clinicians are also interested in gaining insight into symptoms and clusters of symptoms to inform the development of symptom management strategies. The need to extract clinical signs and symptoms from patient records and further analyze has benefits for many other diseases. With the advanced research in natural language process (NLP) and AI, a systemic review [27] found that a number of studies applied NLP techniques to clinical notes in the EHR for symptom analysis. However, most of the existing research on symptom cluster extraction are either using statistical methods, such as factor analysis [23][47][11], PCA [23][15][11], or traditional clustering methods, such as hierarchical clustering, after modeling the occurrence of the symptoms as binary vectors [50][35]. Some other research investigated the semantic clustering of the symptoms[21] while ignoring the symptom co-occurrence which is the main definition of symptom clusters clinically. One research applied graph models to identify relevant symptoms to a given symptom - symptom expansion[41]. Our research is the first to investigate using graph neural networks to discover the symptom clusters based on the clinical notes.

Expanding upon the methodologies in the literature and our previous research, we examined the following three questions in this research: 1) How to model symptom correlations using graph models?; 2) How to discover the symptom clusters using graph clustering?; 3) How to analyze and evaluate the discovered symptom clusters? To the best of our knowledge, no previous work has applied unsupervised graph clustering on clinical notes to answer these questions.

This study presents a symptom mining system, SymptomGraph, that generates a symptom graph based on clinical notes to discover the symptom clusters and the clinical parameters associated with the symptom clusters. A SymptomGraph is a graph with extracted symptoms as nodes and the co-occurrences of symptoms as edges. The graph learning algorithm Node2Vec is trained to provide a general representation of symptoms to support symptom clustering. We evaluated the proposed SymptomGraph on symptom clusters discovered by our system against the clinical parameters through a literature search in the clinical domain.

The main contribution of our paper include:

Develop graph model based on the co-occurrences of symptoms.
Apply graph clustering to discover the symptom clusters.
Evaluate the symptom clusters based on medical knowledge reflected by literature.
Demonstrate that our model can be applied to narrative text to discover the symptom clusters of other chronic or acute diseases.

2. RELATED WORK

2.1. Symptoms and Symptom Clusters Extraction

The clinicians apply symptoms or symptom clusters to analyze the etiology of many diseases. However, identifying symptoms or symptom clusters from text or other data types is often challenging and requires interdisciplinary domain knowledge. Researchers have adopted qualitative and quantitative methods to detect symptoms or symptom clusters from various data sources. Kim et al. [24] described various statistical methods to identify and quantify symptom clusters. Canonical correlation, partial correlation, structural equation modeling were the statistical methods used to identify symptom clusters. Sondhi et al. [41] introduced SympGraph, which is a framework to expand a given set of symptoms to other related symptoms by analyzing the underlying graph structure built based on the co-occurrences of the symptoms. Ni et al. [33] built a symptom-disease bipartite network that can identify and rank the disease and symptom clusters simultaneously. They applied domain network clustering and cross-network cluster ranking. Linder et al. [29] used a symptom assessment application that adolescents can use to claim temporary relationships among symptoms themselves. Then, the researchers can find out groups of symptoms. Barsevick [4] described different ways of identifying symptoms clusters, such as through expert opinion, group comparisons, evidence of shared variance, identification of subgroups, etc. Gu et al. [21] developed SymptomID to discover COVID-19 symptoms from news reports. The SymptomID used several transformer-based models, such as BERT [14], GPT [36] and XLNet [51], for symptom extraction. Afterward, DBSCAN is used as a clustering method to group the symptom expressions semantically to extract the main symptom expressions. Papachristou et al. [35] applied five different clustering algorithms, including K-modes, Birtch, Spectral, Agglomerative hierarchical clustering, and k-means, then compared to Latent Class Analysis method for symptom clustering using patient registry records. Neijenhuijs et al. [32] performed cluster analysis with HDBSCAN[6], an extension of the DBSCAN [17] clustering algorithm and showed that symptom clusters could make specific interventions among cancer survivors in their recent study. Chow et al. [11] conducted three statistical approaches: Exploratory factor analysis, principal component analysis (PCA), and hierarchical cluster analysis to detect symptom clusters in non-metastatic breast cancer patients treated by radiation therapy (RT).

2.2. Graph Models and its Application to Clinical Data

Graph clustering is a way of grouping the nodes such that similar nodes stay together and nodes with dissimilar properties stay apart. The graph clustering has been extensively used for community detection problems [18]. In 2016, Grover et al. [20] published the Node2Vec method to learn the continuous feature representations for nodes in networks by mapping nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. Since then, the Node2Vec model has been applied in different domains, including the clinical domain for node prediction, edge prediction, and other tasks. Shen et al. [40] leveraged Node2Vec to generate node embeddings for the Human Phenotype Ontology (HPO) to assist phenotypic similarity measurement. The results show that the application of the HPO embedding on link prediction achieved 0.81 ROAUC and 0.75 F-measure. Du et al. [16] built a linked graph based on the extracted clinical trial information for registered COVID-19 clinical trials. Each clinical trial is a node in the graph. Then, Node2Vec is applied to learn the embeddings of the clinical trials. The results show that the learned embeddings can assist the clinical trial search and visualization. Kim et al. [26] applied a graph model to understand complex and diverse mechanisms of the biological pathways. The graph was built by extracting genes as nodes and identifying the co-occurrence as edges. The Node2Vec was used to generate the embeddings of the genes. The results showed that the graph model could extract the relationships between genes and identify the gene-gene interactions involved in a type 2 diabetes pathway. Chen et al. [10] investigated the drug-target interactions (DTIs) by using the graph models to understand the mechanism of drug action. They built the molecular associations network and applied Node2Vec to generate the embeddings. Finally, they performed a random forest (RF) classifier to predict potential drug-target pairs. Their results showed that the generated embedding helped to obtain 87.37% accuracy. Lee et al. [28] investigated five embedding generation methods, including Node2Vec, to learn the medical concept embedding based on the Medical concepts defined by the Observational Health Data Science and Informatics (OHDSI) Observational Medical Outcomes Partnership common data model (OMOP CDM). Their results showed that the embedding learned by Node2Vec performed better than other models when selected parameters were used. Oniani et al. [34] also used Node2Vec to analyze a COVID-19 data set. They built the graph based on the co-occurrences among chemicals, diseases, genes, and mutations by using a linked data set CORD-19-on-FHIR. The DBSCAN was used to identify 63 clusters using silhouette values, and five coronavirus infectious diseases were detected in their corresponding subgroups.

To the best of our knowledge, our innovative research is the first to build a symptom graph based on the symptom co-occurrences and applied graph clustering to discover symptom clusters using narrative EHR clinical notes.

3. METHODOLOGY

The system to identify the symptom clustering from EHR clinical notes has four main components, shown as Figure 1. The first component is developed to extract symptom expressions from the raw clinical notes. Since the same symptom might be expressed in different terms, we developed a component to group the symptom expressions based on the semantic meanings. Then, each group is represented by a typical symptom of the group. Based on the identified typical symptom, the co-occurrences of the symptoms are used to build a symptom graph. Each symptom is a node on the graph, whereas an edge between two nodes represents the co-occurrences of the two symptoms within a clinical note. After constructing the symptom graph, the state-of-the-art feature learning method for network analysis - Node2Vec is applied to generate symptom embedding for symptom representation. The clustering algorithms are utilized to generate symptom clusters by working on the symptom embeddings.

3.1. Symptom Extraction

We utilized UMLS MetaMap [2] to extract the symptoms from the narrative clinical notes. UMLS MetaMap is an NLP tool that can map phrases in the text to different semantic types using various language sources. The UMLS MetaMap can map phrases in a text into 127 semantic types. This research used two semantic types - ‘Sign or Symptom’ and ‘Mental and Behavioral Dysfunction’ to identify symptoms from clinical notes. Figure 2 provides an example of clinical notes with highlighted terms mapped into these two semantic types using UMLS MetaMap. UMLS MetaMap includes functions to detect negation and Word Sense Disambiguation (WSD). Therefore, it can exclude most of the negating symptoms. However, because some parts of the clinical notes were not written as sentences, some negation situations cannot be fully detected by UMLS MetaMap. We developed a pattern-based approach to remove those negation cases that UMLS MetaMap cannot detect.

Figure 2: — Symptom Phrases Detection using UMLS MetaMap

3.2. Semantic Grouping

After identifying symptoms and extracting from the clinical notes using UMLS MetaMap, we found many symptoms have similar semantic meanings but are written in different expressions. For example, ‘fatigue’, ‘exhaustion’, ‘general fatigue’, ‘generalized fatigue’, ‘lethargic’, ‘lethargy’ and ‘tiredness’ indicate the same symptom - ‘fatigue’ of the patient. These symptoms need to be grouped together to be differentiated from other typical symptoms. To group the extracted symptom expressions with similar semantic meaning, we used universal sentence encoder [9] first to convert symptom expressions that include one or more words into embedding representations. Our previous research [30] demonstrated that a universal sentence encoder performs well to identify similar semantic expressions in the clinical text narratives.We then used a hierarchical clustering algorithm with cosine distance measure to generate clusters of the symptom expressions to group them into different typical symptom expressions, such as ‘anxious’, ‘numbness’, ‘fatigue’ etc. The most frequent symptom expression in each cluster represents the group.We reduced the various symptom expressions to a set of typical symptom expressions through this process.

3.3. Symptom Graph Construction

This research aims to identify the symptoms co-related to each other during the progression of diseases or after treatments. Instead of applying typical principal component analysis or factor analysis, we investigated a graph model to understand the relations between symptoms. We constructed a symptom graph based on the symptom expressions extracted from each clinical note. Any extracted symptom expressions from the clinical notes are converted to the corresponding typical symptom expression by referring to the symptom semantic grouping. For example, if ‘lethargic’ is described in a clinical note, ‘fatigue’ is used to represent it. It is worth noting that we only consider it once if ‘fatigue’ is mentioned multiple times in a clinical note. The reason is that we want to focus on the co-occurrences of different symptoms.

To construct a symptom graph, let G = (V, E) be an undirected graph, V is the set of vertices representing the typical symptom expressions, and E is the set of edges representing the symptoms that occur together in a clinical note. After processing all clinical notes in our data set, the symptom graph was built. Indeed,we generated a weighted graph G. The weight on each edge indicates the number of co-occurrences between the symptoms. In order to limit the impact of the rare co-occurrences, we removed the edges that have a weight less than a predefined threshold θ, which means if two symptoms occur simultaneously less than θ times in the clinical notes, the co-occurrence is not considered in this research. Table 1 shows snippets of five clinical notes with the extracted symptoms. Figure 3 shows the symptom graph built based on these clinical notes.

Table 1:

Snippets of Clinical Notes and Extracted Symptoms

Clinical Report Snippets	Symptoms
…. The patient has been having a lot of *tingling, numbness* with discomfort in hands and feet after each chemotherapy dose …….Review was positive for persistent *fatigue…..* Recently he has had a feeling of continuous *nausea*…	fatigue,nausea,numbness,tingle
… her legs are just *weak*…. She does have some *numbness* and *tingling* of her fingers…*Depression*…OXYCODONE causes *nausea* and *vomiting*.	depression,nausea,numbness,tingle,vomit,weakness
…Difficulty walking and *weakness*/diffuse atrophy…and *constipation*…as needed for *nausea*…as needed for *sleeplessness*…	constipation,nausea,insomnia,weakness
…He had complaints of *vomiting* after he had eaten a breakfast about 3 hours ago….Tobacco abuse…	tobacco abuse,vomit

Open in a new tab

Figure 3: — Graph Built based on Clinical Notes in Table 1

3.4. Symptom Graph Clustering

3.4.1. Symptom Embedding Generation.

Graph Neural Networks (GNNs) [52] have been applied to many graph-based applications for node classification or link prediction in clinical domain [19, 22]. However, unsupervised GNNs have not been extensively explored for clinical note analysis in the clinical domains, although unsupervised graph clustering can be more resistant to advances in GNNs. Node2Vec [20] is one of the state-of-the-art unsupervised GNNs with an identity feature matrix, which can be trained to generate node embeddings, meaning each node learns its vector representation. This research adopted a weighted Node2Vec model to generate the symptom embedding based on the symptom graph.

The Node2Vec uses a biased random walk procedure to explore neighborhoods in a Breadth-first Sampling (BFS) and Depth-first Sampling (DFS) fashion. The BFS explores the immediate neighbors of the source node, whereas the DFS gives information about the nodes located at a larger distance from the source node, passing a global view of the graph structure. Node2Vec interpolate between BFS and DFS by forming a random walk heuristic of fixed length. Given a graph G = (V, E), the walk from node c to y is defined by the probability density function P in Equation 1, where $\frac{μ_{c y}}{N}$ is the transition probability between node y and c. e_cy is the edge in G that enable the connection between node y and c. N is the normalizing constant.

P (y ∣ c) = {\begin{array}{l} \frac{μ_{c y}}{N} & if e_{c y} \in G \\ 0 & otherwise \end{array}

(1)

For a weighted graph, the random walk is designed to utilize the weights on the edges to compute the transition probabilities between two nodes. As shown in Figure 4, if the random walk came through edge (n, c) from node n to c, the transition probability μ_nc on edge (n, c) is calculated as Equation 2, where β_pq (n, y) is determined by distance (d_ny) between n and a neighbor (y) of c, and w_cy is the weight on the edge between a neighbor (y) and c.

μ_{c y} = β_{p q} (n, y) \cdot w_{c y} y \in {d, e, f, n}

(2)

There are three types of neighbors in the calculation. Based on graph shown as Figure 4, weights are assigned differently to each type, shown as Equation 3. Both p and q control the speed of the walk exploration and departure from the neighborhood of starting node. p sets the possibility of immediately re-exploring a node in the walk, whereas q allows the search to distinguish inward and outward nodes. The edge weights manipulate the possibility of reaching the next node from the previous node.

β_{p q} (n, y) = {\begin{array}{l} \frac{1}{p} & if d_{n y} = 0, when y = n \\ 1 & if d_{n y} = 1, when y = e \\ \frac{1}{q} & if d_{n y} = 2, when y = d o r f \end{array}

(3)

After sampling the maximum number of random walks with maximum length L, the random walks are used to generate node embeddings. The objective of the embedding generation process is to maximize the logarithmic probability of observing a network neighborhood M_s (u) for a node u based on its feature representation. Equation 4 is the objective function of the Node2Vec training process, P (M_s (u)|f (u)) is the probability of exploring adjacent nodes of node u from its embedded space. The skip-gram negative sampling model is used to build pairs of inputs and context nodes of specified window size to feed into the training of the graph neural network.

m a x_{f} (\sum_{u \in V} log P (M_{s} (u) ∣ f (u)))

(4)

3.4.2. Symptom Clustering.

After Node2Vec is used to project the nodes of the symptoms into the vector space, different clustering algorithms are applied, including K-Means, Hierarchical Agglomerative, Gaussian Mixture Models and DBScan.

K-Means:

In the K-Means clustering algorithm data points are assigned to exactly one cluster of the predefined K number of clusters. For these K clusters there are K centroids and those centroids are updated in such a way that the sum of the squared distance between the data points in a cluster and the corresponding centroid of that cluster is minimum.

Hierarchical Clustering (HC):

The HC algorithm hierarchically generate bigger clusters from single data points using the agglomerative approach. Two data points are merged into a single cluster in each step depending on some distance criterion. The type of distance being considered gives raise of different popular hierarchical clustering methods such as ward, single, average, weighted and so on.

Gaussian Mixture Models (GMMs):

Gaussian Mixture Models (GMMs) presumes that there are a definite number of Gaussian distributions, and each of these distributions symbolizes a cluster. GMM prefers to group the data points pertaining to a single distribution together. These are probabilistic models which employ soft clustering approach for distributing the points in various clusters. For a dataset of K clusters, there should be a mixture of K Gaussian Distributions each having a specific mean and variance. Expectation-Maximization (EM) techniques is used to determine these values.

DBSCAN:

DBSCAN refers to Density-Based Spatial Clustering of Applications with Noise (DBSCAN) which organize points that are close to each other depending on a distance measurement and a minimum number of points. It has the ability to detect the outliers among the points that are in low-density regions. For a fixed minimum number of points, changing the distance value we generate different number of clusters.

To evaluate the clustering methods and identify the optimum number of clusters for the symptom graph, the Dunn Index (DI) is calculated for different number of clusters. DI is the ratio of the smallest inter-cluster distance to the largest intra-cluster distance. Given k clusters, the Dunn Index is calculated as:

D I_{k} = \frac{min_{1 \leq i \leq j \leq k} δ (S_{i}, S_{j})}{max_{1 \leq n \leq k} Δ_{n}}

(5)

where the largest intra-cluster distance δ(S_i, S_j) of Cluster S_i and S_j and the smallest inter-cluster distance Δ_n consisting of n elements can be calculated using any of the single, complete, average or centroid linkage distance functions. Better clustering distribution maintains the fact that the clusters are furthest away each other and elements in each cluster are closer together, hence higher DI is expected for satisfactory clustering result.

4. EXPERIMENTAL RESULTS

4.1. Data Set

The patient cohort in this research consists of patients with a primary diagnosis of colorectal cancer with and without diabetes who have records in a large academic medical center’s EHR system. The colorectal cancer patients were identified using the International Classification of Diseases (ICD). The ICD codes are 153–154 (ICD-9) and C18–C20 (ICD-10). This research focused on investigating the symptom clusters one year after the patients received the first chemotherapy. Through these ICD codes, we identified colorectal cancer patients who received chemotherapy within the ten years of 2007–2017. Among these patients, 996 patients have clinical notes within one year after the chemotherapy treatment. For each patient, we extracted clinical notes to generate a symptom graph for symptom clustering. There are 1675 clinical notes, and over 119 types of clinical notes were included. Figure 5 shows the top 10 major types of clinical notes. Discharge summary and progress notes are the major types of clinical notes. Figure 6 shows the gender distribution, percentage of patients with diabetes, and percentage distribution of the patients based on the stage of cancer. There is slightly more male in our data set. About 28.2% patients have diabetes, and 1.6%, 16.8%, 51.4% patients are in cancer stage 0, II, III, respectively. The pathologic stage of the cancer is unknown for 30.2% of patients.

Figure 5: — Top 10 Frequent Types of Clinical Notes

Figure 6: — Distribution of the Clinical Parameters

Other than the clinical notes, we included gender, diabetes status, the pathologic stage of cancer, and age at diagnosis as clinical parameters to understand the relationships between those data points and symptom clusters.

4.2. Extracted Symptoms and Construct Symptom Graph

Following symptom extraction, 106 unique positive symptom expressions were extracted from the data set. After applying the semantic grouping, we grouped them into 19 typical symptoms. Table 2 shows the typical symptoms and the grouped symptom expression along with their document frequency (DF - the number of clinical notes that have these symptom expressions). The symptom extraction results show that ‘nausea’, ‘vomit’, and ‘fatigue’ are typical symptoms reported in more than 700 clinical notes within one year after the chemotherapy. The least reported symptom within the same timeframe was ‘memory loss’.

Table 2:

Extracted Symptoms from Clinical Notes

Symptoms	DF	Symptom Expressions
abdominal discomfort	86	abdominal discomfort, abdominal cramping, abdominal tenderness, abdominal cramp, abdominal symptom, abdominal, abdominal fullness
abdominal bloating	39	abdominal bloating, bloating, bloating symptom
insomnia	88	insomnia, sleeplessness
vomit	751	vomit, persistent vomit, intractable vomit
shake	35	shake, shake chill
anxious	90	anxious, nervousness, nervous, panic attack, anxious behavior
weakness	349	weakness, weak, generalize weakness, feel weakness, general weakness, muscle weakness, generalize muscle weakness
memory loss	15	memory loss, memory problem, forgetful, forgetfulness
poor appetite	29	poor appetite, diminish appetite, reduce appetite
fatigue	731	fatigue, tire, burning, lethargic, lethargy, tiredness, generalize fatigue, exhaust, exhaustion, combat fatigue, extreme fatigue, general fatigue
alcohol abuse	32	alcohol abuse, acute alcoholic intoxication, alcoholism, etoh abuse
tobacco abuse	52	tobacco abuse, abuse, marijuana abuse, substance abuse, drug abuse, medication abuse
constipation	265	constipation, chronic constipation
depression	320	depression, depressed, depress mood, mental depression, depression and anxiety, anxiety and depression, major depressive disorder, mixed anxiety and depressive disorder, chronic depression, depressive disorder, reactive depression, depression symptom, depressive symptom, minimal depression, single episode of major depressive disorder
numbness	198	numbness, leg numbness, numbness in finger, hand numbness, numbness in foot, numbness of finger, low extremity numbness, numbness in toe, numbness in leg, numbness of hand, numbness of toe
tingle	112	tingle, have tingle sensation, tingle in finger, tingle finger
cramp	74	cramp, crampy, cramp sensation quality, muscle cramp, acute pain, leg cramp
diarrhea	26	diarrhea, vomit diarrhoea, diarrhea and vomiting, severe diarrhea
nausea	1121	nausea, nausea vomiting, chronic nausea, nausea and vomiting, postoperative nausea, symptom nausea

Open in a new tab

Based on the extracted symptoms and their co-occurrences within the clinical report and the frequency of the co-occurrences, we constructed the symptom graph G, shown as Figure 7. The weights on the edges represent the frequency of the co-occurrences in the clinical notes. In this research, we set the θ introduced in 3.3 to be 5, which means we only included the edges with a weight of 5 or more. The symptom graph shows that some symptoms often co-occur together. For example, ‘nausea’ and ‘vomit’ co-occur 680 times, ‘nausea’ and ‘depression’ co-occur 109 times, etc. This symptom graph provides direct visualization of the co-occurrences of the symptoms.

Figure 7: — Symptom Graph Built Using Our Data Set

4.3. Symptom Graph Clustering Results

To generate the symptom clusters, we first applied the weighted Node2Vec algorithm described in Section 3.4 to the graph to generate symptom embeddings. The hyperparameters of the weighted Node2Vec are optimized (shown in Table 3). In this research, we set the size of the embeddings to be 512.

Table 3:

Hyperparameters of Weighted Node2Vec

Parameter	Value
Maximum length of a random walk	100
Number of random walks per root node	10
Probability of returning to source node (1/p)	2
Probability for moving away from source node (1/q)	0.5
Random seed	42
Size of embedding	512
Size of window	5
Number of epochs	11000

Open in a new tab

After the symptom embeddings are generated, we applied t-Distributed Stochastic Neighbor Embedding (t-SNE) [48] - an unsupervised, non-linear technique for high dimensional data visualizing, to visualize the symptom distribution. Figure 8 shows the t-SNE imaging on the two-dimensional space. From the visualization, it can be noted that some symptoms are closer to each other than the others. For example, ‘tobacco abuse’ and ‘alcohol abuse’ are closely located than the other symptoms, whereas ‘memory loss’ is relatively far from the rest of the symptoms.

Figure 8: — t-SNE Visualization of Symptom Distribution

We calculated the Dunn Index value of all different clustering algorithm with different k values. As shown in Figure 9, HC gains the highest DI than the rest of the methods. To further investigate, we applied different HC methods 10 based on the distance calculation, and found that ‘ward’ is better than ‘single’, ‘average’ and ‘weighted’ distance calculations. Hence, we determined the optimum number of symptom cluster is 9.

Figure 11 shows the clustered symptom graph with symptoms within the same cluster are colored the same. The t-SNE visualization also reflects these cluster distributions. There are four clusters with only one symptom in each: ‘memory loss’, ‘shake’, ‘diarrhea’, and ‘poor appetite’. Based on the t-SNE visualization, these single symptom clusters are not close to each other and far from other symptoms, which infers that these symptoms have less co-relation with other symptoms based on our data set. These single symptoms reflect either individual gastrointestinal problems or mental issues. There are two big clusters with four symptoms in each. One of them has ‘anxious’, ‘insomnia’, ‘weakness’, and ‘depression’ that are mainly mood problems after the cancer patients start the chemotherapy [37, 44, 45]. The other big cluster includes ‘nausea’, ‘fatigue’, ‘vomit’, and ‘constipation’ which are the typical gastrointestinal distress and physical function symptoms after the colorectal cancer patient receives the chemotherapy [1, 8]. Three clusters include two or three symptoms. ‘Tobacco abuse’ and ‘Alcohol abuse’ are in the same cluster and found to be associated with other symptom expressions, such as significantly higher pain [13]. The research also showed that alcohol abuse is highly prevalent in patients with advanced cancer, and a subgroup of cancer patients are more likely to have a history of chemotherapy or are actively smoking [13]. ‘Tingling’ and ‘Numbness’ are two symptoms linked to the neuropathic side effects commonly reported in patients receiving chemotherapy [42]. Based on the literature [46], tingling or numbness in the fingers and hands and/or toes and feet were the frequently reported peripheral neuropathy symptoms. ‘Abdominal discomfort,’ ‘Abdominal bloating,’ and ‘cramp’ are grouped in one cluster. These are gastrointestinal symptoms often shown in colorectal cancer patients [7].

4.4. Statistical Analysis and Evaluation of the Symptom Clusters

To further evaluate the symptom clusters discovered in our data set, we applied statistical analysis on the symptoms in each one of the clusters to identify the associated clinical parameters. Table 4 shows the number of patients who have at least one symptom in each symptom cluster along with the mean and standard deviation of age, gender distribution, diabetes patient percentage, and patients in each cancer stage. The colored circles in Table 4 match the symptom nodes in Figure 11. The t-test was used to determine the patient age differences between patients of any two symptoms clusters, and Pearson’s chi-squared test was used to determine whether there was a statistically significant difference between the two clusters regarding gender, diabetes status, and cancer stage. The patients both compared clusters are excluded when calculating the t-test and Pearson’s chi-squared test. It means there is no overlap between compared two clusters.

Table 4:

Statistics of the Patients with Symptoms of Each Cluster

	C0 tobacco abuse alcohol abuse	C1 poor appetite	C2 shake	C3 depression weakness anxious insomnia	C4 vomit nausea fatigue constipation	C5 numbness tingle	C6 cramp abdominal discomfort abdominal bloating	C7 diarrhea	C8 memory loss
# of patients	58	28	30	414	917	156	148	26	14
age
mean (std)	57.48 (9.09)	63.57 (14.28)	54.5 (15.48)	60.4 (12.44)	59.19 (12.57)	58.03 (10.97)	54.55 (12.58)	60.27 (14.84)	61.07 (16.75)
Gender
Female (%)	60.34	50	66.67	47.34	51.36	53.21	51.35	57.69	42.86
Male (%)	39.66	50	33.33	52.66	48.64	46.79	48.65	42.31	57.14
Diabetes (%)	25.86	17.86	30	34.3	27.92	34.62	22.3	38.46	50
Cancer Stage
Unknown (%)	53.45	35.71	40	34.06	29.88	19.87	31.76	26.92	14.29
0 (%)	0.0	0.0	0.0	0.72	1.64	1.28	2.03	0.0	0.0
II (%)	15.52	21.43	20	17.15	17.23	14.1	19.59	7.69	0.0
III (%)	31.03	42.86	40	48.07	51.25	64.74	46.62	65.38	85.71

Open in a new tab

The median age of patients in cluster C0 is 57.48. The t-test analysis shows the patients who have symptoms of ‘tobacco abuse’ or ‘alcohol abuse’ are statistically younger than those having symptoms in either C1 (p=0.04) or C3 (p=0.03). Patients with either ‘tobacco abuse’ or ‘alcohol abuse’ are younger than patients who have ‘depression’ or ‘poor appetite’. The literature also demonstrates patients with advanced cancer who have alcoholism, tobacco, or illegal abuse are relatively younger [13], which is consistent with our findings. The t-test results also show that patients in cluster C1 are statistically older (mean aged 63.57) than those in C6 (p<0.01). Research indicates loss of appetite is especially serious in elderly patients who are critically ill, such as cancer patients [38], taste and smell changes occur with advancing age, which can lead to poor appetite [12]. Table 4 shows that relatively more patients with diabetes experience symptoms in C5, C7, and C8. The chi-squared test results show that more patients with diabetes experienced symptom ‘numbness’ or ‘tingling’ in fingers, hands, or lower extremity than symptoms in C1 (p<0.02), C4 (p<0.03), and C6 (p<0.01). Although our study data set includes 996 CRC patients, it extends similar results found in the recent research [39, 49], specifically, that CRC patients with diabetes experienced milder to severe neuropathic symptoms, such as tingling or numbness in fingers or hands. From the cancer stage prospect, the percentage of patients in cancer stage III is high in clusters C5, C7, and C8. The statistical analysis results show that, the number of patients in later cancer stage in C5 is significant than those in C0 (p<0.01), C1 (p<0.01), C2 (p<0.01), C3 (p<0.01), C4 (p<0.01) and C6 (p<0.01), which means later cancer stage patients experience neuropathic symptoms. A recent study on Korean cancer patients shows that stage IV cancer and history of chemotherapy were identified as predictors of neuropathic cancer pain [3], this is similar to our finding. However, our data set only included up to stage III cancers. The patients who experience ‘memory loss’ had mostly in stage III cancer, the statistical analysis shows that more stage III patients experience ‘memory loss’ than symptoms in C0 (p<0.01), C1 (p<0.01), C2 (p<0.01), C3 (p<0.01), C4 (p<0.01) and C6 (p<0.01) after the chemotherapy. Although we have not identified associated literature for this finding based on our data set, it could be a valuable research hypothesis for further exploration from the clinical side.

5. DISCUSSION

This innovative research developed an NLP+AI system to first identify symptom expressions from the EHR clinical notes. We then developed a graph network-based clustering approach to identify the symptom clusters based on symptom co-occurrences in each clinical note. This system is relatively easy to implement, and the graph visualization can assist clinicians in identifying symptom co-occurrences using the clinical data. Compared to other research in symptom cluster generation, our study demonstrates how to generate symptom graphs based on the narrative text of the clinical notes in the EHR to identify the symptom clusters. Although this research analyzed and evaluated the SymptomGraph using a data set of colorectal cancer patients, the system can be applied to other patient cohorts to discover the symptom clusters experienced by patients before or after treatments. The symptom cluster findings can assist clinicians in discovering hidden symptoms clusters, developing and validating clinical hypotheses. Based on the symptom clusters and their associations with the clinical parameters, personalized treatments or symptom management strategies can be investigated by clinicians. The statistical analysis of the symptom clusters shows that the clinical parameters, such as comorbidities like diabetes, age, or cancer stage, can be used as predictors of symptom clusters after a treatment like chemotherapy. Our statistical analysis results in the colorectal cancer patients show that the SymptomGraph can identify similar findings to previous clinical research in the literature as well as new symptom clusters which need further clinical validation and investigation.

There are a few limitations to this research. Although word sense disambiguation (WSD) is enabled in UMLS MetaMap, due to the accuracy of the WSD of UMLS MetaMap, there could be some symptom expressions in the clinical notes that were left out. Also, signs and symptoms written with typos cannot be identified. The uncommon signs and symptoms that manifest in less than five clinical notes were not included due to the fact that this research is centered on the overall performance of the SymptomGraph on discovering the ordinary symptom clusters.

6. CONCLUSIONS AND FUTURE WORK

EHR clinical report is the main instrument for a clinician to document patient symptoms before and after treatment. Extracting and analyzing the documented symptoms for chronic or acute diseases is essential for improving patient care and developing personalized clinical decision systems. Our research team developed a system for symptom clusters identification using both NLP and AI techniques. The system includes NLP components for symptom expression extraction, semantic grouping, and AI components to identify symptom clusters based on graph learning and clustering. The patients who have the symptoms in a cluster can be further analyzed with other clinical parameters, such as ethnicity, comorbidities, medication, etc. Future work should include building a large graph to include symptoms, genotypes, and phenotypes to discover their correlations.

Figure 10: — Dunn Index of Different HC Methods

CCS CONCEPTS.

• Applied computing → Health informatics; • Computing methodologies → Artificial intelligence;

ACKNOWLEDGEMENT

This research was partially supported by the national institute of general medical sciences grant 1R15GM139094.

Contributor Information

Fattah Muhammad Tahabi, Department of ECE, IUPUI.

Susan Storey, School of Nursing, Indiana University.

Xiao Luo, Department of CIT, IUPUI.

REFERENCES

[1].Andreyev HJN, Norman AR, Oates J, and Cunningham D. 1998. Why do patients with weight loss have a worse outcome when undergoing chemotherapy for gastrointestinal malignancies? European journal of cancer 34, 4 (1998), 503–509. [DOI] [PubMed] [Google Scholar]
[2].Aronson Alan R. 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.. In Proceedings of the AMIA Symposium. American Medical Informatics Association, 17. [PMC free article] [PubMed] [Google Scholar]
[3].Baek Sun Kyung, Shin Sang Won, Koh Su-Jin, Kim Jung Han, Kim Hyo Jung, Shim Byoung Yong, Kang Seok Yun, Bae Sang Byung, Yun Hwan Jung, Sym Sun Jin, et al. 2021. Significance of descriptive symptoms and signs and clinical parameters as predictors of neuropathic cancer pain. PloS one 16, 8 (2021), e0252781. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Barsevick Andrea M. 2007. The elusive concept of the symptom cluster.. In Oncology nursing forum, Vol. 34. [DOI] [PubMed] [Google Scholar]
[5].Bytzer Peter, Talley Nicholas J, Leemon Melanie, Young Lisa J, Jones Michael P, and Horowitz Michael. 2001. Prevalence of gastrointestinal symptoms associated with diabetes mellitus: a population-based survey of 15 000 adults. Archives of internal medicine 161, 16 (2001), 1989–1996. [DOI] [PubMed] [Google Scholar]
[6].Campello Ricardo JGB, Moulavi Davoud, and Sander Jörg. 2013. Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining. Springer, 160–172. [Google Scholar]
[7].Cappell Mitchell S. 2008. Pathophysiology, clinical presentation, and management of colon cancer. Gastroenterology Clinics of North America 37, 1 (2008), 1–24. [DOI] [PubMed] [Google Scholar]
[8].Carlotto Alan, Hogsett Virginia L, Maiorini Elyse M, Razulis Janet G, and Sonis Stephen T. 2013. The economic burden of toxicities associated with cancer treatment: review of the literature and analysis of nausea and vomiting, diarrhoea, oral mucositis and fatigue. Pharmacoeconomics 31, 9 (2013), 753–766. [DOI] [PubMed] [Google Scholar]
[9].Cer Daniel, Yang Yinfei, Kong Sheng-yi, Hua Nan, Limtiaco Nicole, Rhomni St John, Noah Constant, Mario Guajardo-Céspedes, Yuan Steve, Tar Chris, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018). [Google Scholar]
[10].Chen Zhan-Heng, You Zhu-Hong, Guo Zhen-Hao, Yi Hai-Cheng, Luo Gong-Xu, and Wang Yan-Bin. 2020. Predicting Drug-Target Interactions by Node2vec Node Embedding in Molecular Associations Network. In International Conference on Intelligent Computing. Springer, 348–358. [Google Scholar]
[11].Chow Selina, Wan Bo Angela, Pidduck William, Zhang Liying, DeAngelis Carlo, Chan Stephanie, Yee Caitlin, Drost Leah, Leung Eric, Sousa Philomena, et al. 2019. Symptom clusters in patients with breast cancer receiving radiation therapy. European Journal of Oncology Nursing 42 (2019), 14–20. [DOI] [PubMed] [Google Scholar]
[12].de Jong Nynke, Mulder Ina, de Graaf Cees, and van Staveren Wija A. 1999. Impaired sensory functioning in elders: the relation with its potential determinants and nutritional intake. Journals of Gerontology Series A: Biomedical Sciences and Medical Sciences 54, 8 (1999), B324–B331. [DOI] [PubMed] [Google Scholar]
[13].Dev Rony, Parsons Henrique A, Palla Shana, Palmer J Lynn, Del Fabbro Egidio, and Bruera Eduardo. 2011. Undocumented alcoholism and its correlation with tobacco and illegal drug use in advanced cancer patients. Cancer 117, 19 (2011), 4551–4556. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [Google Scholar]
[15].Dong Skye Tian, Butow Phyllis N, Costa Daniel SJ, Lovell Melanie R, and Agar Meera. 2014. Symptom clusters in patients with advanced cancer: a systematic review of observational studies. Journal of pain and symptom management 48, 3 (2014), 411–450. [DOI] [PubMed] [Google Scholar]
[16].Du Jingcheng, Wang Qing, Wang Jingqi, Ramesh Prerana, Xiang Yang, Jiang Xiaoqian, and Tao Cui. 2021. COVID-19 Trial Graph: A Linked Graph for COVID-19 Clinical Trials. Journal of the American Medical Informatics Association (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Ester Martin, Kriegel Hans-Peter, Sander Jörg, Xu Xiaowei, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In kdd, Vol. 96. 226–231. [Google Scholar]
[18].Girvan Michelle and Newman Mark EJ. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences 99, 12 (2002), 7821–7826. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Golmaei Sara Nouri and Luo Xiao. 2021. DeepNote-GNN: predicting hospital readmission using clinical notes and patient network. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 1–9. [Google Scholar]
[20].Grover Aditya and Leskovec Jure. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Gu Kang, Vosoughi Soroush, and Prioleau Temiloluwa. 2021. SymptomID: a framework for rapid symptom identification in pandemics using news reports. ACM Transactions on Management Information Systems (TMIS) 12, 4 (2021), 1–17. [Google Scholar]
[22].Kang Chuanze, Zhang Han, Liu Zhuo, Huang Shenwei, and Yin Yanbin. 2021. LR-GNN: a graph neural network based on link representation for predicting molecular associations. Briefings in Bioinformatics (2021). [DOI] [PubMed] [Google Scholar]
[23].Kim Hee-Ju. 2008. Common factor analysis versus principal component analysis: choice for symptom cluster research. Asian nursing research 2, 1 (2008), 17–24. [DOI] [PubMed] [Google Scholar]
[24].Kim Hee-Ju and Abraham Ivo L. 2008. Statistical approaches to modeling symptom clusters in cancer patients. Cancer nursing 31, 5 (2008), E1–E10. [DOI] [PubMed] [Google Scholar]
[25].Kim Hee-Ju, McGuire Deborah B, Tulman Lorraine, and Barsevick Andrea M. 2005. Symptom clusters: concept analysis and clinical implications for cancer nursing. Cancer nursing 28, 4 (2005), 270–282. [DOI] [PubMed] [Google Scholar]
[26].Kim Munui, Baek Seung Han, and Song Min. 2018. Relation extraction for biological pathway construction using node2vec. BMC bioinformatics 19, 8 (2018), 75–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Koleck Theresa A, Dreisbach Caitlin, Bourne Philip E, and Bakken Suzanne. 2019. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. Journal of the American Medical Informatics Association 26, 4 (2019), 364–379. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Lee Junghwan, Liu Cong, Jae Hyun Kim Alex Butler, Shang Ning, Pang Chao, Natarajan Karthik, Ryan Patrick, Ta Casey, and Weng Chunhua. 2021. Comparative effectiveness of medical concept embedding for feature engineering in phenotyping. JAMIA open 4, 2 (2021), ooab028. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Linder Lauri A, Ameringer Suzanne, Baggott Christina, Erickson Jeanne, Macpherson Catherine Fiona, Rodgers Cheryl, and Stegenga Kristin. 2015. Measures and methods for symptom and symptom cluster assessment in adolescents and young adults with cancer. In Seminars in oncology nursing, Vol. 31. Elsevier, 206–215. [DOI] [PubMed] [Google Scholar]
[30].Luo Xiao, Gandhi Priyanka, Storey Susan, Zhang Zuoyi, Han Zhi, and Huang Kun. 2021. A computational framework to analyze the associations between symptoms and cancer patient attributes post chemotherapy using EHR data. IEEE Journal of Biomedical and Health Informatics 25, 11 (2021), 4098–4109. [DOI] [PubMed] [Google Scholar]
[31].McFarland Daniel C, Shaffer Kelly M, Tiersten Amy, and Holland Jimmie. 2018. Physical symptom burden and its association with distress, anxiety, and depression in breast cancer. Psychosomatics 59, 5 (2018), 464–471. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Neijenhuijs Koen I, Peeters Carel FW, van Weert Henk, Cuijpers Pim, and Verdonck-de Leeuw Irma. 2021. Symptom clusters among cancer survivors: what can machine learning techniques tell us? BMC Medical Research Methodology 21, 1 (2021), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Ni Jingchao, Fei Hongliang, Fan Wei, and Zhang Xiang. 2017. Automated medical diagnosis by ranking clusters across the symptom-disease network. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 1009–1014. [Google Scholar]
[34].Oniani David, Jiang Guoqian, Liu Hongfang, and Shen Feichen. 2020. Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases. Journal of the American Medical Informatics Association 27, 8 (2020), 1259–1267. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Papachristou Nikolaos, Miaskowski Christine, Barnaghi Payam, Maguire Roma, Farajidavar Nazli, Cooper Bruce, and Hu Xiao. 2016. Comparing machine learning clustering with latent class analysis on cancer symptoms’ data. In 2016 IEEE Healthcare Innovation Point-Of-Care Technologies Conference (HI-POCT). IEEE, 162–166. [Google Scholar]
[36].Radford Alec and Wu Jeffrey. 2019. Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9. [Google Scholar]
[37].Redeker Nancy S, Lev Elise L, and Ruggiero Jeanne. 2000. Insomnia, fatigue, anxiety, depression, and quality of life of cancer patients undergoing chemotherapy. Scholarly inquiry for nursing practice 14, 4 (2000), 275–290. [PubMed] [Google Scholar]
[38].Schiffman SS and Graham BG. 2000. Taste and smell perception affect appetite and immunity in the elderly. European journal of clinical nutrition 54, 3 (2000), S54–S63. [DOI] [PubMed] [Google Scholar]
[39].Sempere-Bigorra Mar, Julián-Rochina Iván, and Cauli Omar. 2021. Chemotherapy-induced neuropathy and diabetes: a scoping review. Current Oncology 28, 4 (2021), 3124–3138. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Shen Feichen, Liu Sijia, Wang Yanshan, Wang Liwei, Wen Andrew, Limper Andrew H, and Liu Hongfang. 2018. Constructing node embeddings for human phenotype ontology to assist phenotypic similarity measurement. In 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W). IEEE, 29–33. [Google Scholar]
[41].Sondhi Parikshit, Sun Jimeng, Tong Hanghang, and Zhai ChengXiang. 2012. SympGraph: a framework for mining clinical notes through symptom relation graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 1167–1175. [Google Scholar]
[42].Staff Nathan P, Grisold Anna, Grisold Wolfgang, and Windebank Anthony J. 2017. Chemotherapy-induced peripheral neuropathy: a current review. Annals of neurology 81, 6 (2017), 772–781. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Stavem Knut, Ghanima Waleed, Olsen Magnus Kringstad, Gilboe Hanne Margrethe, and Einvik Gunnar. 2021. Persistent symptoms 1.5–6 months after COVID-19 in non-hospitalised subjects: a population-based cohort study. Thorax 76, 4 (2021), 405–407. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Sun Guang-Wei, Yang Yi-Long, Yang Xue-Bin, Wang Yin-Yin, Cui Xue-Jiao, Liu Ying, and Xing Cheng-Zhong. 2020. Preoperative insomnia and its association with psychological factors, pain and anxiety in Chinese colorectal cancer patients. Supportive Care in Cancer 28, 6 (2020), 2911–2919. [DOI] [PubMed] [Google Scholar]
[45].Theobald Dale E. 2004. Cancer pain, fatigue, distress, and insomnia in cancer patients. Clinical cornerstone 6, 1 (2004), S15–S21. [DOI] [PubMed] [Google Scholar]
[46].Tofthagen Cindy, McAllister R Denise, and McMillan Susan C. 2011. Peripheral neuropathy in patients with colorectal cancer receiving oxaliplatin. Clinical journal of oncology nursing 15, 2 (2011). [DOI] [PubMed] [Google Scholar]
[47].Tsai Jaw-Shiun, Wu Chih-Hsun, Chiu Tai-Yuan, and Chen Ching-Yu. 2010. Significance of symptom clustering in palliative care of advanced cancer patients. Journal of pain and symptom management 39, 4 (2010), 655–662. [DOI] [PubMed] [Google Scholar]
[48].Van der Maaten Laurens and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008). [Google Scholar]
[49].Vissers Pauline AJ, Mols Floortje, Thong Melissa SY, Pouwer Frans, Vreugdenhil Gerard, and van de Poll-Franse Lonneke V. 2015. The impact of diabetes on neuropathic symptoms and receipt of chemotherapy among colorectal cancer patients: results from the PROFILES registry. Journal of Cancer Survivorship 9, 3 (2015), 523–531. [DOI] [PubMed] [Google Scholar]
[50].Walsh Declan and Rybicki Lisa. 2006. Symptom clustering in advanced cancer. Supportive care in cancer 14, 8 (2006), 831–836. [DOI] [PubMed] [Google Scholar]
[51].Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime, Salakhutdinov Russ R, and Le Quoc V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019). [Google Scholar]
[52].Zhou Jie, Cui Ganqu, Hu Shengding, Zhang Zhengyan, Yang Cheng, Liu Zhiyuan, Wang Lifeng, Li Changcheng, and Sun Maosong. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81. [Google Scholar]

[R1] [1].Andreyev HJN, Norman AR, Oates J, and Cunningham D. 1998. Why do patients with weight loss have a worse outcome when undergoing chemotherapy for gastrointestinal malignancies? European journal of cancer 34, 4 (1998), 503–509. [DOI] [PubMed] [Google Scholar]

[R2] [2].Aronson Alan R. 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.. In Proceedings of the AMIA Symposium. American Medical Informatics Association, 17. [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Baek Sun Kyung, Shin Sang Won, Koh Su-Jin, Kim Jung Han, Kim Hyo Jung, Shim Byoung Yong, Kang Seok Yun, Bae Sang Byung, Yun Hwan Jung, Sym Sun Jin, et al. 2021. Significance of descriptive symptoms and signs and clinical parameters as predictors of neuropathic cancer pain. PloS one 16, 8 (2021), e0252781. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Barsevick Andrea M. 2007. The elusive concept of the symptom cluster.. In Oncology nursing forum, Vol. 34. [DOI] [PubMed] [Google Scholar]

[R5] [5].Bytzer Peter, Talley Nicholas J, Leemon Melanie, Young Lisa J, Jones Michael P, and Horowitz Michael. 2001. Prevalence of gastrointestinal symptoms associated with diabetes mellitus: a population-based survey of 15 000 adults. Archives of internal medicine 161, 16 (2001), 1989–1996. [DOI] [PubMed] [Google Scholar]

[R6] [6].Campello Ricardo JGB, Moulavi Davoud, and Sander Jörg. 2013. Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining. Springer, 160–172. [Google Scholar]

[R7] [7].Cappell Mitchell S. 2008. Pathophysiology, clinical presentation, and management of colon cancer. Gastroenterology Clinics of North America 37, 1 (2008), 1–24. [DOI] [PubMed] [Google Scholar]

[R8] [8].Carlotto Alan, Hogsett Virginia L, Maiorini Elyse M, Razulis Janet G, and Sonis Stephen T. 2013. The economic burden of toxicities associated with cancer treatment: review of the literature and analysis of nausea and vomiting, diarrhoea, oral mucositis and fatigue. Pharmacoeconomics 31, 9 (2013), 753–766. [DOI] [PubMed] [Google Scholar]

[R9] [9].Cer Daniel, Yang Yinfei, Kong Sheng-yi, Hua Nan, Limtiaco Nicole, Rhomni St John, Noah Constant, Mario Guajardo-Céspedes, Yuan Steve, Tar Chris, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018). [Google Scholar]

[R10] [10].Chen Zhan-Heng, You Zhu-Hong, Guo Zhen-Hao, Yi Hai-Cheng, Luo Gong-Xu, and Wang Yan-Bin. 2020. Predicting Drug-Target Interactions by Node2vec Node Embedding in Molecular Associations Network. In International Conference on Intelligent Computing. Springer, 348–358. [Google Scholar]

[R11] [11].Chow Selina, Wan Bo Angela, Pidduck William, Zhang Liying, DeAngelis Carlo, Chan Stephanie, Yee Caitlin, Drost Leah, Leung Eric, Sousa Philomena, et al. 2019. Symptom clusters in patients with breast cancer receiving radiation therapy. European Journal of Oncology Nursing 42 (2019), 14–20. [DOI] [PubMed] [Google Scholar]

[R12] [12].de Jong Nynke, Mulder Ina, de Graaf Cees, and van Staveren Wija A. 1999. Impaired sensory functioning in elders: the relation with its potential determinants and nutritional intake. Journals of Gerontology Series A: Biomedical Sciences and Medical Sciences 54, 8 (1999), B324–B331. [DOI] [PubMed] [Google Scholar]

[R13] [13].Dev Rony, Parsons Henrique A, Palla Shana, Palmer J Lynn, Del Fabbro Egidio, and Bruera Eduardo. 2011. Undocumented alcoholism and its correlation with tobacco and illegal drug use in advanced cancer patients. Cancer 117, 19 (2011), 4551–4556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [Google Scholar]

[R15] [15].Dong Skye Tian, Butow Phyllis N, Costa Daniel SJ, Lovell Melanie R, and Agar Meera. 2014. Symptom clusters in patients with advanced cancer: a systematic review of observational studies. Journal of pain and symptom management 48, 3 (2014), 411–450. [DOI] [PubMed] [Google Scholar]

[R16] [16].Du Jingcheng, Wang Qing, Wang Jingqi, Ramesh Prerana, Xiang Yang, Jiang Xiaoqian, and Tao Cui. 2021. COVID-19 Trial Graph: A Linked Graph for COVID-19 Clinical Trials. Journal of the American Medical Informatics Association (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Ester Martin, Kriegel Hans-Peter, Sander Jörg, Xu Xiaowei, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In kdd, Vol. 96. 226–231. [Google Scholar]

[R18] [18].Girvan Michelle and Newman Mark EJ. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences 99, 12 (2002), 7821–7826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Golmaei Sara Nouri and Luo Xiao. 2021. DeepNote-GNN: predicting hospital readmission using clinical notes and patient network. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 1–9. [Google Scholar]

[R20] [20].Grover Aditya and Leskovec Jure. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Gu Kang, Vosoughi Soroush, and Prioleau Temiloluwa. 2021. SymptomID: a framework for rapid symptom identification in pandemics using news reports. ACM Transactions on Management Information Systems (TMIS) 12, 4 (2021), 1–17. [Google Scholar]

[R22] [22].Kang Chuanze, Zhang Han, Liu Zhuo, Huang Shenwei, and Yin Yanbin. 2021. LR-GNN: a graph neural network based on link representation for predicting molecular associations. Briefings in Bioinformatics (2021). [DOI] [PubMed] [Google Scholar]

[R23] [23].Kim Hee-Ju. 2008. Common factor analysis versus principal component analysis: choice for symptom cluster research. Asian nursing research 2, 1 (2008), 17–24. [DOI] [PubMed] [Google Scholar]

[R24] [24].Kim Hee-Ju and Abraham Ivo L. 2008. Statistical approaches to modeling symptom clusters in cancer patients. Cancer nursing 31, 5 (2008), E1–E10. [DOI] [PubMed] [Google Scholar]

[R25] [25].Kim Hee-Ju, McGuire Deborah B, Tulman Lorraine, and Barsevick Andrea M. 2005. Symptom clusters: concept analysis and clinical implications for cancer nursing. Cancer nursing 28, 4 (2005), 270–282. [DOI] [PubMed] [Google Scholar]

[R26] [26].Kim Munui, Baek Seung Han, and Song Min. 2018. Relation extraction for biological pathway construction using node2vec. BMC bioinformatics 19, 8 (2018), 75–84. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Koleck Theresa A, Dreisbach Caitlin, Bourne Philip E, and Bakken Suzanne. 2019. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. Journal of the American Medical Informatics Association 26, 4 (2019), 364–379. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Lee Junghwan, Liu Cong, Jae Hyun Kim Alex Butler, Shang Ning, Pang Chao, Natarajan Karthik, Ryan Patrick, Ta Casey, and Weng Chunhua. 2021. Comparative effectiveness of medical concept embedding for feature engineering in phenotyping. JAMIA open 4, 2 (2021), ooab028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Linder Lauri A, Ameringer Suzanne, Baggott Christina, Erickson Jeanne, Macpherson Catherine Fiona, Rodgers Cheryl, and Stegenga Kristin. 2015. Measures and methods for symptom and symptom cluster assessment in adolescents and young adults with cancer. In Seminars in oncology nursing, Vol. 31. Elsevier, 206–215. [DOI] [PubMed] [Google Scholar]

[R30] [30].Luo Xiao, Gandhi Priyanka, Storey Susan, Zhang Zuoyi, Han Zhi, and Huang Kun. 2021. A computational framework to analyze the associations between symptoms and cancer patient attributes post chemotherapy using EHR data. IEEE Journal of Biomedical and Health Informatics 25, 11 (2021), 4098–4109. [DOI] [PubMed] [Google Scholar]

[R31] [31].McFarland Daniel C, Shaffer Kelly M, Tiersten Amy, and Holland Jimmie. 2018. Physical symptom burden and its association with distress, anxiety, and depression in breast cancer. Psychosomatics 59, 5 (2018), 464–471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Neijenhuijs Koen I, Peeters Carel FW, van Weert Henk, Cuijpers Pim, and Verdonck-de Leeuw Irma. 2021. Symptom clusters among cancer survivors: what can machine learning techniques tell us? BMC Medical Research Methodology 21, 1 (2021), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Ni Jingchao, Fei Hongliang, Fan Wei, and Zhang Xiang. 2017. Automated medical diagnosis by ranking clusters across the symptom-disease network. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 1009–1014. [Google Scholar]

[R34] [34].Oniani David, Jiang Guoqian, Liu Hongfang, and Shen Feichen. 2020. Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases. Journal of the American Medical Informatics Association 27, 8 (2020), 1259–1267. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Papachristou Nikolaos, Miaskowski Christine, Barnaghi Payam, Maguire Roma, Farajidavar Nazli, Cooper Bruce, and Hu Xiao. 2016. Comparing machine learning clustering with latent class analysis on cancer symptoms’ data. In 2016 IEEE Healthcare Innovation Point-Of-Care Technologies Conference (HI-POCT). IEEE, 162–166. [Google Scholar]

[R36] [36].Radford Alec and Wu Jeffrey. 2019. Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9. [Google Scholar]

[R37] [37].Redeker Nancy S, Lev Elise L, and Ruggiero Jeanne. 2000. Insomnia, fatigue, anxiety, depression, and quality of life of cancer patients undergoing chemotherapy. Scholarly inquiry for nursing practice 14, 4 (2000), 275–290. [PubMed] [Google Scholar]

[R38] [38].Schiffman SS and Graham BG. 2000. Taste and smell perception affect appetite and immunity in the elderly. European journal of clinical nutrition 54, 3 (2000), S54–S63. [DOI] [PubMed] [Google Scholar]

[R39] [39].Sempere-Bigorra Mar, Julián-Rochina Iván, and Cauli Omar. 2021. Chemotherapy-induced neuropathy and diabetes: a scoping review. Current Oncology 28, 4 (2021), 3124–3138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Shen Feichen, Liu Sijia, Wang Yanshan, Wang Liwei, Wen Andrew, Limper Andrew H, and Liu Hongfang. 2018. Constructing node embeddings for human phenotype ontology to assist phenotypic similarity measurement. In 2018 IEEE International Conference on Healthcare Informatics Workshop (ICHI-W). IEEE, 29–33. [Google Scholar]

[R41] [41].Sondhi Parikshit, Sun Jimeng, Tong Hanghang, and Zhai ChengXiang. 2012. SympGraph: a framework for mining clinical notes through symptom relation graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 1167–1175. [Google Scholar]

[R42] [42].Staff Nathan P, Grisold Anna, Grisold Wolfgang, and Windebank Anthony J. 2017. Chemotherapy-induced peripheral neuropathy: a current review. Annals of neurology 81, 6 (2017), 772–781. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Stavem Knut, Ghanima Waleed, Olsen Magnus Kringstad, Gilboe Hanne Margrethe, and Einvik Gunnar. 2021. Persistent symptoms 1.5–6 months after COVID-19 in non-hospitalised subjects: a population-based cohort study. Thorax 76, 4 (2021), 405–407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Sun Guang-Wei, Yang Yi-Long, Yang Xue-Bin, Wang Yin-Yin, Cui Xue-Jiao, Liu Ying, and Xing Cheng-Zhong. 2020. Preoperative insomnia and its association with psychological factors, pain and anxiety in Chinese colorectal cancer patients. Supportive Care in Cancer 28, 6 (2020), 2911–2919. [DOI] [PubMed] [Google Scholar]

[R45] [45].Theobald Dale E. 2004. Cancer pain, fatigue, distress, and insomnia in cancer patients. Clinical cornerstone 6, 1 (2004), S15–S21. [DOI] [PubMed] [Google Scholar]

[R46] [46].Tofthagen Cindy, McAllister R Denise, and McMillan Susan C. 2011. Peripheral neuropathy in patients with colorectal cancer receiving oxaliplatin. Clinical journal of oncology nursing 15, 2 (2011). [DOI] [PubMed] [Google Scholar]

[R47] [47].Tsai Jaw-Shiun, Wu Chih-Hsun, Chiu Tai-Yuan, and Chen Ching-Yu. 2010. Significance of symptom clustering in palliative care of advanced cancer patients. Journal of pain and symptom management 39, 4 (2010), 655–662. [DOI] [PubMed] [Google Scholar]

[R48] [48].Van der Maaten Laurens and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008). [Google Scholar]

[R49] [49].Vissers Pauline AJ, Mols Floortje, Thong Melissa SY, Pouwer Frans, Vreugdenhil Gerard, and van de Poll-Franse Lonneke V. 2015. The impact of diabetes on neuropathic symptoms and receipt of chemotherapy among colorectal cancer patients: results from the PROFILES registry. Journal of Cancer Survivorship 9, 3 (2015), 523–531. [DOI] [PubMed] [Google Scholar]

[R50] [50].Walsh Declan and Rybicki Lisa. 2006. Symptom clustering in advanced cancer. Supportive care in cancer 14, 8 (2006), 831–836. [DOI] [PubMed] [Google Scholar]

[R51] [51].Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime, Salakhutdinov Russ R, and Le Quoc V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019). [Google Scholar]

[R52] [52].Zhou Jie, Cui Ganqu, Hu Shengding, Zhang Zhengyan, Yang Cheng, Liu Zhiyuan, Wang Lifeng, Li Changcheng, and Sun Maosong. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81. [Google Scholar]

PERMALINK

SymptomGraph: Identifying Symptom Clusters from Narrative Clinical Notes using Graph Clustering

Fattah Muhammad Tahabi

Susan Storey

Xiao Luo

Abstract

1. INTRODUCTION

2. RELATED WORK

2.1. Symptoms and Symptom Clusters Extraction

2.2. Graph Models and its Application to Clinical Data

3. METHODOLOGY

Figure 1:

3.1. Symptom Extraction

Figure 2:

3.2. Semantic Grouping

3.3. Symptom Graph Construction

Table 1:

Figure 3:

3.4. Symptom Graph Clustering

3.4.1. Symptom Embedding Generation.

Figure 4:

3.4.2. Symptom Clustering.

K-Means:

Hierarchical Clustering (HC):

Gaussian Mixture Models (GMMs):

DBSCAN:

4. EXPERIMENTAL RESULTS

4.1. Data Set

Figure 5:

Figure 6:

4.2. Extracted Symptoms and Construct Symptom Graph

Table 2:

Figure 7:

4.3. Symptom Graph Clustering Results

Table 3:

Figure 8:

Figure 9:

Figure 11:

4.4. Statistical Analysis and Evaluation of the Symptom Clusters

Table 4:

5. DISCUSSION

6. CONCLUSIONS AND FUTURE WORK

Figure 10:

CCS CONCEPTS.

ACKNOWLEDGEMENT

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases