Abstract
Significance.
Analyzing narratives in patients’ medical records using a framework that combines Natural Language Processing (NLP) and machine learning may help uncover the underlying patterns of patients’ visual capabilities and challenges that they are facing, and could be useful in analyzing Big Data in optometric research.
Purpose.
The primary goal of this study was to demonstrate the feasibility of applying a framework that combines NLP and machine learning to analyze narratives in patients’ medical records. To test and validate our framework, we applied it to analyze records of low vision patients and to address two questions: Was there association between patients’ narratives related to activities of daily living and the quality of their vision? Was there association between patients’ narratives related to activities of daily living and their attitudes toward certain “assistive items”?
Methods.
Our dataset consisted of 616 records of low vision patients. From patients’ complaint history, we selected multiple keywords that were related to common activities of daily living. Sentences related to each keyword were converted to numerical data using NLP techniques. Machine learning was then applied to classify the narratives related to each keyword into two categories, labeled based on different “factors of interest” (acuity, contrast sensitivity and attitudes of patients toward certain “assistive items”).
Results.
Using our proposed framework, when patients’ narratives related to specific keywords were used as input, our model effectively predicted the categories of different factors of interest with promising performance. For example, we found strong associations between patients’ narratives and their acuity or contrast sensitivity for certain activities of daily living (e.g. “drive” in association with acuity and contrast sensitivity).
Conclusions.
Despite our limited dataset, our results show that the proposed framework was able to extract the semantic patterns stored in medical narratives, and to predict patients’ sentiments and quality of vision.
Medical records of patients contain a wealth of information that can provide answers to many research questions or help generate new research directions. Therefore, it is not surprising that many studies have retrospectively analyzed patient records. Analyzing patients’ medical records manually is a tedious task, even for structured data (standardized, clearly defined data, e.g. age, gender). For other information contained in medical records such as patients’ narratives, analyzing the information may require more advanced techniques. Medical narratives are records of what patients say during a healthcare examination. As such, they can provide insights into the experiences and challenges faced by patients. Understanding these narratives can benefit healthcare providers in tailoring their services to better meet the specific needs of patients and enhancing the overall quality of care. Further, these narratives can reveal gaps in current healthcare services, including areas where patients might need more support or resources. However, analyzing medical narratives has its own shortcomings. First, medical narratives are often subjective and qualitative in nature, and are subject to patients’ own criteria. The choice of wordings is also subject to cultural and other personal factors. Second, different patients may use different wordings to express or describe the same thing, thus introducing variability, or “noise” in the data. For example, when a patient complains about her headache, she may describe it as “intermediate, intense, throbbing and pounding pain”, but the same description by another patient could mean a completely different frequency, intensity and nature of headache; or, yet another patient suffering from exactly the same frequency, intensity and nature of headache may describe her headache as “on-and-off, sudden pulsating pain that lasts for a minute or so each time”. To cap it off, while analyzing the narratives for a handful of patients is a feasible task, analyzing the narratives for a large number of records (e.g. for quality control of clinics, or for large-scale multi-center clinical trials) would be quite labor intensive, and calls for an automated approach. To circumvent these problems, in this study, we proposed a framework that combines a Natural Language Processing (NLP) method with machine learning techniques to analyze patients’ narratives from optometric examination records. Machine learning makes use of statistical algorithms to learn from data, thus creating a model which can then be used to generalize to new datasets. NLP is a subfield of artificial intelligence capable of processing and interpreting human language datasets. There have been previous attempts in applying NLP to analyze eye examination records, but the majority of these studies focused on extracting specific text (e.g. “cataract”,1 “exfoliation syndrome”2) or information (e.g. visual acuity3,4,5 from free-text notes (not from patients’ narratives)) for quantitative analysis. A few studies used the extracted information to develop models to predict prognosis of diseases.6,7 To our knowledge, this is the first time that NLP has been applied to analyzing patients’ narratives.
Specifically, in this study, we tested and validated our approach using narratives extracted from records of low vision patients. Low vision services are oriented toward addressing patients’ goals in their daily living, instead of merely treating the underlying disease or correcting for refractive errors, thus making narratives from low vision examination highly suitable for NLP analyses. Clearly, there are many ways that we could apply our approach to analyzing patients’ narratives, depending on the specific research questions. Here, we described a couple of examples of applications. The first question we asked was whether there was any association between what patients said in relation to several activities of daily living and the quality of their vision. A strong association would imply that clinicians could get a sense of what to expect of a patient’s vision based simply on what the patient says during history-taking, and this could be very valuable in vision screening as well. The second question was whether there existed any association between what patients said in relation to several activities of daily living and their sentiments toward using certain tools (such as canes) or modifications in their environments (such as better lighting). Because we needed to extract the underlying sentiments based on the words that patients used, which are qualitative in nature, we used a type of analysis referred to as “sentiment analysis” in NLP as a computational technique to convert these unstructured and qualitative data into structured data for quantitative analysis and sentiment identification. In the field of NLP, sentiment analysis is often used to automatically classify text data into multiple categories that gauge the overall opinion, emotion, or sentiment conveyed by the text, and has been applied in various domains, such as customer feedback reviews, social media monitoring, news analysis, and academic research. In contrast to structured data, e.g. numerical data frames organized into columns and rows, text-based narratives are unstructured and non-quantitative. As such, sentiment analysis requires extensive cleaning and preprocessing, including irrelevant content removal, standardization and normalization, as well as understanding the semantic relationships and inferences within text data to indicate the sentiments expressed.8
To recapitulate, the primary goal of this study was to show the feasibility of using a framework that combines NLP and machine learning to analyze narratives in patients’ medical records. The secondary goal was to address two specific research questions that we used to demonstrate how the framework could be applied.
METHODS
Our general approach was as follows: we extracted keywords that were informative for our specific research questions and used NLP to convert patients’ narratives related to each of the keywords into structured data via pre-processing, feature extraction, and transformation (corresponding to the three sub-sections under the section NLP Processing, respectively). These keywords can be any words that are indicative about the topics and questions concerned by researchers, such as cook, read, walk, drive, TV, shopping, cinema, etc. Then depending on the specific research question, we identified the factors of interest and converted them into category labels. These factors of interest should be easily categorized, such as patients’ sentiments towards specific tools (cane, magnifier, telescope, etc.) that can be manually decided and labeled, or scalable vision measurements (visual acuity, contrast sensitivity, stereopsis, etc.). In our case, the category labels were dichotomous. Subsequently, we fed the structured data and the category labels to a “classifier” and applied machine learning for the classifier to learn any association between the data based on the keyword-related text and the category labels. After learning, we tested how well the model could be used to predict the category label for an unseen set of structured data (i.e. data that were not used for training), thus determining the generalizability of the model. A schematic illustration of these processes are shown in Figure 1.
Figure 1.

Schematic of the framework in this study.
The rest of this section is arranged as follows: First, in the sub-section: General Description of Dataset, we provided the relevant information we used from the original medical records. Then, we detailed the procedures of NLP in the sub-section: NLP Processing, which were applied to text related to the chosen keywords. Last, based on our aims and questions to be addressed, we conducted machine learning, and the study was separated into three parts: Model Selection, Question 1, and Question 2. In the last sub-section: Machine Learning, we illustrated the particular methods of machine learning for each part, including the keyword selection, category labeling, and models.
General Description of Dataset
Data used in this study were extracted from the electronic medical records of all patients (N=616, 412 females) attending the Low Vision Clinic of the Meredith Morgan University Eye Center at the University of California, Berkeley, during the five-year period between November 1, 2017 and November 1, 2022. Over this period of time, a minimum of seven clinicians were involved in providing patient care in the Low Vision Clinic. Because only deidentified data were used, this research was deemed to meet the definition of non-human subject research by the Institutional Review Board of the University of California, Berkeley.
Data extraction and removal of all protected health information were performed by a clinic staff who was not part of our research team, before we were allowed access to the dataset. Our limited de-identified dataset contained the following information: age, gender, complaint history and assessment results. The age and gender distributions of all these patients are shown in Figure 2 and a sample of the assessment results in this dataset is provided in Figure 3.
Figure 2.

Age and gender distribution of patients in our dataset. Age distribution for female (blue) and male (orange) patients. Gray areas indicate where two colors overlap.
Figure 3.

A sample of assessment results.
NLP Processing
Text processing and feature extraction procedures applied to our dataset were accomplished using the Python library: Natural Language Toolkit (NLTK).9 Machine learning procedures that were applied subsequently made use of the Python library: scikit-learn (sklearn).10
Text Preprocessing
We first extracted text from the field containing patients’ complaint history. In our Low Vision Clinic, history-taking was performed organically and clinicians asked open-ended questions such that patients could answer in their own words. However, there were questions that were always covered, including the goals of the patients, their living situation, what they perceived of their vision, whether they saw better at night or during the day, whether they still drove (and if so, the frequency and distance), whether they still worked, hobbies, how they typically spent their day, etc. Extracted text was pre-processed using the following procedures. Each record of narrative was first converted to all lower-cases and then split into sentences using punctuations that signified the end of a sentence, such as periods (.), question marks (?), and exclamation points (!). We then applied word tokenization to split sentences into individual words, while numbers, punctuations, and irrelevant characters were removed. Subsequently, we identified the “stop words” (e.g., ‘the’, ‘is’, ‘and’, ‘not’), which are common in the document but contain little value in conveying the meaning of what patients said. Although there is no universally accepted list of stop words, we used a predefined list provided by NLTK to identify these stop words in the text data. At this step, these words were not removed from the original sentences because some stop words still carry important sentiment information, such as “not”, “don’t”, etc. A word tag was paired with each non-stop word in one of four categories: adjectives, adverb, verb, and noun, based on the word position within the context. Finally, both stop and non-stop words were lemmatized to be reduced to their root forms, also called “lemma”, using the “WordNetLemmatizer” function from the NLTK. An example illustrating how sentences in the original record were transformed during these procedures is shown in Figure 4.
Figure 4.

An example of text preprocessing procedures. For the word tags, we used “j”, “r”, “v”, “n”, and “s” to denote adjectives, adverb, verb, noun, and stop-word.
Keyword-related Feature Extraction
In general, “keywords” can be any word that is of interest to an investigator. Table 1 lists the keywords that we utilized for each section of this study. Overall, we selected nine keywords: “cook”, “read”, “walk, “drive”, “see”, “watch”, “TV”, “cane” and “light” based on the questions we posed, and the frequency of appearance of these words in our dataset. To ensure that we had enough data affiliated with each keyword to be used in the model training and testing, a keyword was only chosen if it could result in at least 125 records. This threshold of 125 records was determined based on our model’s complexity and results of our preliminary tests. For the purpose of Model Selection and to address our questions, we used a total of nine keywords. For each chosen keyword, sentences containing the keyword were extracted from the pre-processed text data from all patients. After this step, only these sentences were preserved while others were considered to be irrelevant and excluded for subsequent analysis. Within the extracted data, all unique non-stop words were identified and the frequency of occurrence of each of these words was counted. Because the diversity in how individuals express comparable sentiments leads to exceedingly high dimensionality and redundancy in the narratives, we used sets of synonyms (using the “synsets” function) provided by NLTK to reduce redundancy in the dataset by replacing words in the text with a common representative from their respective “synsets”. In practice, we used a subset of words from the unique-word collection that had the highest frequencies to serve as the “representative words”. Using this method, non-stop words in the sentences were replaced by representative words which had the highest level of path similarity to these words in the synonym domain as calculated by NLTK functions. For example, in the analysis using “walk” as the keyword, “step” was replaced by “stair”, “trouble” was replaced by “difficulty”, and “dog” was replaced by “support”. This “synsets” function provided by NLTK is not perfectly generalizable and thus these representative words might not reflect the best meaning of the corresponding original words, which could introduce additional noise to the data. This tradeoff between the degree of redundancy and the noise from “synsets” can be played with by adjusting the number of representative words to be used in the analysis. Finally, word tokens after replacements were converted to a numerical matrix using a “count vectorizer” approach. This technique created a “vocabulary” of all the unique words in the text corpus. At this point, all patients’ complaints were represented as a matrix, with each row corresponding to a unique patient, and each column corresponding to a word in the vocabulary. The size of this matrix is thus: Length of Records×Length of Vocabulary. The value in each cell of the matrix was the number of times a word appeared in a patient’s complaint field. Because the frequency of a word might not reflect any scaled value, we normalized the matrix into a binary one by replacing all non-zero values with 1.
Table 1.
Keywords, classifiers and category labels used for each section of the study.
| Study Sections | Keywords | Classifiers | Category Labels |
|---|---|---|---|
| Model Selection | “cane”, “light” | Naïve Bayesian Classifier, Support Vector Machine, and Neural Network | Manual labels based on patients’ attitudes towards using cane and extra lighting |
| Question 1 | “cook”, “read”, “walk, “drive”, “see”, “watch”, “TV”, “cane” and “light” | Neural Network | Labels based on patients’ visual acuities and contrast sensitivities. |
| Question 2 | “cook”, “read”, “walk, “drive”, “see”, “watch”, and “TV” | Neural Network | Manual labels based on patients’ attitudes towards using cane and extra lighting |
Feature Transformation: Principal Component Analysis (PCA)
As explained in the previous section, converting text data into a numerical matrix could result in a matrix with an exceedingly high dimensional space and high redundancy due to intercorrelations among words and the existence of meaningless stop-words. To reduce the dimensionality of current data while preserving as much variance as possible, we applied a Principal Component Analysis (PCA). Only principal components that explained at least 80% of the variance in each keyword-based text were retained.
Machine Learning
An integral part of our framework as shown in Figure 1 is the classifier. Because there was no a priori reason to select a specific classifier, we first compared the performances of three popular classifiers for NLP tasks with relatively small data size: Naïve Bayesian Classifier, Support Vector Machine, and Neural Network, and then selected the one with the best performance as the classifier of choice for addressing our two research questions. A summary of the study can be found in Table 1.
1. Model Selection:
For the purpose of model selection, we used “cane” and “light” as the keywords. We chose these two keywords for the model selection because manual determination of sentiments was required at this stage and, in this dataset, categorizations of patients’ sentiments related to these two topics were more articulate compared with others, and also because they are highly related to some activities of daily living, especially mobility. We extracted text from patients’ complaint history that were related to these two keywords and manually generated category labels to represent patients’ sentiments toward these keywords from the same set of patient records (see Table 1). Then we used the classifiers to find how well the structured data converted from the keyword-related text predicted the category labels. Given that the same keywords were used, and the information was extracted from the same data field, we expected to observe a high model performance. This was used to validate our method. The classifier yielding the highest performance in prediction would be used for the rest of the study.
Because patients’ sentiments in a single narrative could consist of mixtures of multiple emotions, such as hope and fear, relief and concern, and thus are difficult to categorize, we manually labeled patients’ narratives in relation to “cane” and “light” as either “positive” or “not-positive”. Two research assistants labeled the patients’ narratives independently and the disagreed labels were discussed, and the decision was made after consultation with one of the authors. To be labeled as “positive”, the narrative must indicate that the patient used the item not just occasionally and had no negative comments about it. Other situations were all labeled as “not-positive”.
We tested three types of models: Naïve Bayesian Classifier (NBC), Support Vector Machine (SVM), and Neural Network (NN) to classify the processed text data related to “cane” and “light” so that we could test if the sentimental patterns could be extracted successfully. The parameters we used for these models were as follows: For NBC, the likelihood of features was assumed to follow a Gaussian distribution. The SVM model in this study used a radial basis function (RBF) kernel with . The neural network we used had one hidden layer containing 80 neurons activated by ReLu function. This neural network model used an “adam” optimizer during training with a stability of 1e-8 and a learning rate of 0.001. It was cross-validated with a random 10% of our training dataset and trained with at most 1000 iterations.
2. Question #1:
Was there any association between what patients said in relation to several activities of daily living and the quality of their vision?
To address this question, the keywords we chose were all related to common activities of daily living: “cook, “read”, “walk, “drive”, “see”, “watch”, “TV”, “cane” and “light”. For the factors of interest, we chose visual acuity and contrast sensitivity as measures of the levels of vision.
Here, we categorized patients’ vision based on their visual acuities (both the better- and the worse-eye acuity in logMAR) and contrast sensitivity (only measured binocularly, in log units). At the Low Vision Clinic of the Meredith Morgan University Eye Center at the University of California, Berkeley, visual acuities were typically measured using a back-lit Bailey Lovie Visual Acuity Chart.11 In the event that patients could not read the letters on the top row of the chart, clinicians would switch to using the Berkeley Rudimentary Vision Test,12 which allowed quantitative measurements of acuity down to 20/16,000. Contrast sensitivity was usually measured using the Mars Contrast Sensitivity Test,13 but occasionally (for example when patients were non-verbal or did not know the Roman alphabet), other contrast sensitivity tests such as the Berkeley Discs Contrast Sensitivity Test (Precision Vision, Woodstock, IL) or the Melbourne Edge Test14 might be used. Although in this Low Vision Clinic, binocular acuity and contrast sensitivity were also typically measured using a pair of filters that transmitted only 4% of light, these measurements were not analyzed in this study. All the reported acuity and contrast sensitivity measurements were obtained under typical room lighting. Figure 5 shows the distributions of these three measurements. Given our limited dataset, current methods cannot reliably regress data extracted from text to a continuous numerical domain, consequently we used a classification task. Classifier models are only capable of coarsely categorizing input into a finite number of groups. Here, we categorized patients’ vision in relation to acuity or contrast sensitivity into two groups despite the fact that both acuity and contrast sensitivity measurements spanned a range of values. With sufficient data and a more complex model, finer resolution in categorization could be achieved. Categorization was accomplished based on a threshold value. For example, we could categorize patients as having contrast sensitivity of above or below a threshold of 1 log unit, or an acuity of better or worse than 1.0 logMAR. Multiple threshold values were tested for each measurement. Considering that too small or too large a threshold would lead to an imbalance in sample size between the two categories, we used 0.5, 0.75, and 1 logMAR as the thresholds for the better-eye acuity; 0.75, 1, and 1.25 logMAR for the worse-eye acuity and 0.75, 1 and 1.25 log units for contrast sensitivity.
Figure 5.

Distribution of visual acuities and contrast sensitivities among patients. Left: Acuity of the better (blue) and worse (orange) eyes. Gray areas indicate the overlapping between two colors. Right: Binocular contrast sensitivity.
3. Question #2:
Was there any association between what patients said in relation to several activities of daily living and their sentiments toward using certain tools or modifications in their environments?
The keywords used here were: “cook, “read”, “walk, “drive”, “see”, “watch”, “TV”. For the factors of interest, we chose to examine patients’ sentiments toward two items — “cane” and “light”. The rationale was that orientation and mobility training (hence, cane-usage) and lighting (some patients such as those with age-related macular degeneration usually benefit from better lighting while those with photophobia may prefer dimmer illumination) were topics that were usually brought up in low vision rehabilitation. Methods in categorizing patients’ sentiments towards “cane” and “light” were identical to those described in the Model Selection section.
RESULTS
Keywords Query
For the purpose of Model Selection and to address our questions, as detailed in Table 1, we used a total of nine keywords in this study: “cook”, “read”, “walk”, “drive”, “see”, “watch”, “TV”, “light”, and “cane”. Because many patients’ narratives did not contain all these keywords, the length of dataset in relation to sentiment analysis corresponding to different keywords varied. Based on our preliminary analysis, we used 30 representative words to represent what patients mentioned with respect to each chosen keyword. Table 2 shows the number of unique patient records containing each of these keywords (Length of Records), and the number of all unique words that were used by all patients in their complaints related to the corresponding keyword (Length of Vocabulary) of each keyword-based matrix.
Table 2:
Parameters for Keywords.
| Keywords | cook | read | walk | drive | see | watch | TV | light | cane |
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| Length of Records | 228 | 496 | 198 | 316 | 456 | 142 | 232 | 289 | 192 |
| Length of Vocabulary | 144 | 209 | 154 | 165 | 452 | 121 | 139 | 158 | 151 |
| Principal Components | 33 | 35 | 34 | 35 | 50 | 28 | 30 | 30 | 35 |
| Explained Variance% | 82 | 81 | 82 | 81 | 82 | 80 | 81 | 81 | 82 |
Principal Component Analysis
PCA was performed on the dataset returned from each keyword-based query and processing. Figure 6 shows the relationship between the cumulative percentage explained variance and the number of principal components associated with each keyword. We selected the number of principal components that explained 80% of variance, as detailed in Table 2. Although the length of vocabulary varied across different keywords, the number of principal components that could explain approximately 80% of the variance was similar. This resulted in a numerical matrix for each keyword with two dimensions: one equaled in length the number of records and another was determined by the selected principal components. These data were used as numerical features input to the classifier.
Figure 6.

Cumulative percentage explained variance versus the number of principal components associated with each keyword used in this study. We constrained the upper limit of x-axis to be only 300 components.
Classification Performance
Due to the limited size of our dataset, the performance based on one-time data partition for the purpose of training and testing might not be generalizable. Thus, for each classification task, we randomly partitioned the dataset generated based on each keyword into a training set (80% of the dataset) and a testing set (the remaining 20% of the dataset) 20 times with different randomization sequences, essentially simulating 20 sets of data. Each simulation yielded an F1 score (the harmonic-mean of precision and recall, ranges between 0 and 1 and higher value indicates better performance) for each of the training and testing phases. We used the F1 score to evaluate the performance of our models rather than accuracy because F1 score is more robust than accuracy when the amount of data in different categories are not equal. The F1 scores for the 20 simulations of the same condition and phase (training/testing) were averaged, and were used to evaluate the performance of our model.
Model Selection
Performances of the three classification models were quite comparable, but the neural network model outperformed SVM and NBC in the classification task for “cane”, with a mean (std) testing F1 of 0.8 (0.05) (SVM: 0.73 (0.09); NBC: 0.73 (0.06)). As for the classification for “light”, NN and SVM had comparable performance (mean (std) testing F1: 0.74 (0.06) and 0.73 (0.07), respectively) and outperformed NBC (mean (std) testing F1: 0.66 (0.05)). Based on these results, for subsequent analyses, we only used NN as the classifier to answer our two research questions.
Question #1:
Was there any association between what patients said in relation to several activities of daily living and their levels of vision?
In this analysis, we only used the neural network to investigate if keyword-related complaints of patients contained sentimental patterns to predict the quality of their vision. Performances of the classification tasks based on each of the nine keywords are shown in Figure 7, for three different thresholds (for categorizing patients into two groups) for each vision-based measurement. These results show that narratives related to different keywords had different performance in predicting both the visual acuities and the contrast sensitivities categories of patients. In addition, for the narratives related to each keyword, their abilities to predict both visual acuities and contrast sensitivities categories depended on the thresholds used to divide these vision measurements into two categories. These findings suggest that, for each vision measurement, there are different boundaries in the feature space for narratives related to different keywords. For example, for the task of predicting contrast sensitivity categories using a threshold of 0.75 log units, patients’ narratives in relation to the keyword “drive” outperformed the narratives related to “see”, indicating the semantic features of “drive” related narratives were more separable between patients whose contrast sensitivities were above and below 0.75 log units. However, when the threshold was 1.25 log units, “see” related narratives outperformed those related to “drive”. This shows that between patients whose contrast sensitivities were above and below 1.25 log units, features extracted from their “see” related narratives were more segregated compared with “drive” related narratives. Across all the results, we found six classification tasks with a mean testing F1 score better than 70%: when patients’ narratives related to “drive”, “light”, and “read” were used to predict if the acuity of the better eye was above or below 1 logMAR (20/200 Snellen, or the definition of legal blindness). The mean testing F1 scrores (std) for these three keywords were 0.81 (0.05), 0.76 (0.05), and 0.73 (0.05), respectively. Also, with patients’ narratives related to “drive” as the input, the neural network was able to predict if patients’ contrast sensitivity was above 0.75 log units or not, resulting in a mean testing F1 score (std) of 0.74 (0.05). Finally, when patients’ narratives related to “cane” and “TV” were used to predict if their contrast sensitivity was above or below 1.25 log units, the resulting mean testing F1 scores (std) were 0.73 (0.06) and 0.7 (0.07), for the two keywords, respectively. Note that our model showed poor performance in many tasks, especially when the acuity of the worse eye and the contrast sensitivity lied in their intermediate levels among all three values. This might imply that sentiments in patients’ complaints about their daily activities were more articulate and less variable when they had high or low vision than for those with intermediate levels of vision. It could also be due to the relatively small sample size, which limited the model complexity and training capacity, and therefore more complex patterns that carried information in categorizing patients by intermediate vision conditions were not well extracted and learned.
Figure 7.

Testing F1 scores for classification tasks using labels based on vision measurements and the neural network. Lines in the figure indicate the mean testing F1 score with respect to the chosen threshold for categorization (x-axis), and shadings indicate 95% of confidence interval.
Question #2:
Was there any association between what patients said in relation to several activities of daily living and their sentiments toward “cane” and “light”?
As shown in Table 3, we found that patients’ narratives associated with the seven keywords were capable of predicting their sentiments toward cane usage and lighting better than chance level (50%). Among all these results, we found two classification tasks that had high testing F1 scores (≥70%): “walk” and “drive” related narratives performed well in predicting sentiments of cane usage, with mean F1 score (std) of 0.77 (0.04) and 0.75 (0.05), for “walk” and “drive”, respectively.
Table 3.
Results of Mean Testing F1 (std) for Question #2.
| Keywords | cook | read | walk | drive | see | watch | TV |
|---|---|---|---|---|---|---|---|
|
| |||||||
| Testing F1 for Cane-based Labeling | 0.57 (0.06) |
0.66 (0.06) |
0.77 (0.04) |
0.75 (0.05) |
0.67 (0.05) |
0.63 (0.09) |
0.65 (0.05) |
| Testing F1 for Light-based Labeling | 0.65 (0.06) |
0.62 (0.06) |
0.61 (0.08) |
0.59 (0.05) |
0.65 (0.05) |
0.64 (0.06) |
0.65 (0.05) |
DISCUSSION
The primary goal of this study was to demonstrate the feasibility of applying a framework that combined NLP and machine learning to analyze narratives in patients’ medical records. To test and validate our framework, we applied it to analyze a set of 616 unique records of low vision patients. Despite the rather small number of records and the difficulties of analyzing unstructured, qualitative data, we showed that (1) our framework was able to successfully extract semantic and sentimental patterns from low vision patients’ narratives; (2) the neural network exhibited strong performance (high testing F1 scores) when analyzing narratives that involved multiple keywords in the classification tasks; and (3) the framework was effective (based on the high F1 scores from our neural network) in predicting various factors of interest, including patients’ vision measurements (we used acuity and contrast sensitivity as examples) and sentiments toward certain items (we used “cane” and “light” as examples). It should be pointed out that our dataset was extracted from clinic records, which in addition to the “noise” from patients using different wordings to express their emotions or sentiments toward using tools, there were other sources of noise in the data, including the fact that there were at least seven clinicians involved in providing patient care to low vision patients during the time period, and that different charts were used to measure acuity and contrast sensitivity. The fact that we were still able to demonstrate the feasibility of the framework speaks to the robustness of the framework.
Are the results from our framework reasonable? Table 3 shows that the testing F1 score for the neural network in predicting patients’ sentiments toward “cane” based on their narratives related to the keyword “cook” was 0.57, a performance that was barely above chance. This implies that “cane” and “cook” are unlikely to be related to each other, which seems reasonable in real life. In contrast, the F1 score between “cane” and “walk” was 0.77, implying a strong association between the two, which also seems reasonable, given that a lot of low vision patients use canes while walking. Similarly, we could also look at whether those F1 scores that suggest a strong association between an activity of daily living and some vision measurements are reasonable. For example, Figure 5 shows that narratives related to the keyword “drive” could predict whether patients’ contrast sensitivity was above or below 0.75 (F1 score = 0.74). This result also seems reasonable in real life because contrast sensitivity impairment is known to be associated with a recent history of crash involvement,15,16 higher collision rates17 or on-road driving performance problems.18,19,20,21 In Figure 5, we also observe a very high F1 score between “drive” and the better-eye acuity (for a threshold of 1.0 logMAR). These results are not inconsistent with what the literature reports on contrast sensitivity being a better predictor of driving ability of low vision patients than visual acuity is.16,17,20,21 Our results simply showed that patients’ narratives could be used to predict whether a low vision patient’s acuity was “good” or “bad” based on an acuity threshold of 1.0 logMAR (20/200 Snellen, or the definition of legal blindness) or a contrast sensitivity threshold of 0.75 log units (severe loss of contrast sensitivity22).
To test and validate our framework, and to demonstrate how to apply the framework to addressing some research questions, we identified two research questions in this study. We first looked for any association between low vision patients’ narratives in relation to several activities of daily living and the quality of their vision. As examples, we used visual acuity and contrast sensitivity as measurements of the quality of vision. Our second question dealt with low vision patients’ sentiments toward some items that they might use in their daily living (we used “cane” and “light” as examples). Clearly, our choices of these two items might be specific to low vision patients. However, one may see how our proposed framework can be easily adopted to answer research questions in other disciplines of optometry and ophthalmology. For example, intra-ocular pressure, a quantitative measurement; and patients’ sentiments toward some glaucoma medications (based on what patients say) could be used as “factors of interest” in glaucoma research. Similarly, accommodative responses and refractive errors could be used as quantitative “factors of interest” in myopia development research, and children’s sentiments toward glasses or certain type of interventions (e.g. atropine) could be extracted to address questions in this area of research.
An important insight from our findings is that the best performance of our model when classifying patients’ narratives related to various keywords was observed at different threshold levels for different vision measurements. This suggests that a cooperative approach, one that incorporates narratives associated with multiple keywords, could enhance the ability of the model in predicting patients’ visual characteristics at a more refined level. However, this would require a much larger dataset with enough patients’ narratives that relate to the various chosen keywords.
A recent study reported that visual acuity of patients could be estimated using their reported abilities in seeing certain objects through a questionnaire approach (Wu et a. IOVS 2022;63:ARVO E-Abstract 4265). The study used a planned structured approach in which patients were asked pre-planned questions, and based on their responses, an estimation of their acuity was made. The approach was drastically different from our study reported here because our data were completely unstructured and thus were more variable. Our dataset also contained more information, for example, our dataset allowed us to perform sentiment analysis to extract the underlying sentiments of patients based on their own words. Nevertheless, one can easily envision that the questionnaire approach about estimating acuity can be incorporated in the clinic or in future research studies, which might improve the estimation of the visual capabilities of patients.
Another interesting finding of this study is that patients’ narratives associated with different keywords yielded different prediction performances. These findings have implications for the development of questionnaires in clinical settings, as such analyses can assist healthcare providers in formulating targeted questions to assess patients for different health conditions. Other applications of our proposed framework include analyzing online surveys concerning patients’ disabilities and living situations, designing clinical trials based on the associations between specific daily activities and visual conditions, evaluation of clinical care delivery (for quality control purpose) and assessing or predicting disease prognosis (by comparing a patient’s data over time).
A major limitation of our study was the size of our dataset, which precluded us from addressing some other interesting questions. For instance, we could not examine patients’ sentiments toward popular assistive devices such as magnifiers and digital products, because of the insufficient number of records (fewer than 100) reporting usage of these devices. Another limitation is that our categorization approach, which encompassed manual sentiment identification and vision measurement segmentation, relied on simple criteria. Sentiments were labeled as either “positive” or “not-positive” and we only tested a limited number of thresholds for vision measurements. Achieving finer classification would necessitate more complex models, and effective training for intricate neural networks such as deep learning would require a much larger dataset. Further, the patient records that we used were not designed for automated data extraction and modeling. Future prospective studies that plan on extracting information from medical records (or records designed for clinical trials) may need to first design better record formats to facilitate the extraction of relevant information and the application of NLP.
Despite these limitations, we have demonstrated how a framework combining NLP and machine learning could be applied to analyzing eye examination records to address specific research questions. Future studies ensuring a sufficiently large number of patient records containing narratives and information about their specific research questions should find this framework useful in analyzing patients’ data.
ACKNOWLEDGMENTS
We thank Tulsi J. Prabhakaran and Meisen Wang for their assistance in data labeling and preliminary analyses.
REFERENCES
- 1.Peissig PL, Rasmussen LV, Berg RL, et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc 2012;19:225–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stein JD, Rahman M, Andrews C, et al. Evaluation of an algorithm for identifying ocular conditions in electronic health record data. JAMA Ophthalmol 2019;137:491–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Smith DH, Johnson ES, Russell A, et al. Lower visual acuity predicts worse utility values among patients with type 2 diabetes. Qual Life Res 2008;17:1277–84. [DOI] [PubMed] [Google Scholar]
- 4.Mbagwu M, French DD, Gill M, et al. Creation of an accurate algorithm to detect snellen best documented visual acuity from ophthalmology electronic health record notes. JMIR Med Inform 2016;4:e4732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Baughman DM, Su GL, Tsui I, et al. Validation of the Total Visual Acuity Extraction Algorithm (TOVA) for automated extraction of visual acuity data from free text, unstructured clinical records. Transl Vis Sci Technol 2017;6:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gui H, Tseng B, Hu W, et al. Looking for low vision: Predicting visual prognosis by fusing structured and free-text data from electronic health records. Int J Med Inform 2022;159:104678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wang SY, Tseng B, Hernandez-Boussard T. Deep learning approaches for predicting glaucoma progression using electronic health records and natural language processing. Ophthalmol Sci 2022;2:100127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhang L, Wang S, Liu B. Deep learning for sentiment analysis: A survey. Wires Data Min Knowl 2018;8:e1253. [Google Scholar]
- 9.Bird S NLTK: The Natural Language Toolkit. In: Curran J, ed. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney, Australia, July 2006:69–72. Available at: https://aclanthology.org/P06-4018.pdf. Accessed May 4, 2024. [Google Scholar]
- 10.Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res 2011;12:2825–30. [Google Scholar]
- 11.Bailey IL, Lovie JE. New design principles for visual acuity letter charts. Am J Optom Physiol Opt 1976;53:740–5. [DOI] [PubMed] [Google Scholar]
- 12.Bailey IL, Jackson AJ, Minto H, et al. The Berkeley Rudimentary Vision Test (BRVT). Optom Vis Sci 2012;89:1257–64. [DOI] [PubMed] [Google Scholar]
- 13.Arditi A Improving the design of the letter contrast sensitivity test. Invest Ophthalmol Vis Sci 2005;46:2225–9. [DOI] [PubMed] [Google Scholar]
- 14.Verbaken JH, Johnston AW. Population norms for edge contrast sensitivity. Am J Optom Physiol Opt 1986;63:724–32. [DOI] [PubMed] [Google Scholar]
- 15.Ball K, Owsley C, Sloane ME, et al. Visual-attention problems as a predictor of vehicle crashes in older drivers. Invest Ophthalmol Vis Sci 1993;34:3110–23. [PubMed] [Google Scholar]
- 16.Green KA, McGwin G Jr., Owsley C. Associations between visual, hearing, and dual sensory impairments and history of motor vehicle collision involvement of older drivers. J Am Geriatr Soc 2013;61:252–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Owsley C, Stalvey BT, Wells J, et al. Visual risk factors for crash involvement in older drivers with cataract. Arch Ophthalmol 2001;119:881–7. [DOI] [PubMed] [Google Scholar]
- 18.Wood JM, Carberry TP. Older drivers and cataracts - measures of driving performance before and after cataract surgery. Transport Res Rec 2004;1865:7–13. [Google Scholar]
- 19.Worringham CJ, Wood JM, Kerr GK, et al. Predictors of driving assessment outcome in parkinson’s disease. Mov Disord 2006;21:230–5. [DOI] [PubMed] [Google Scholar]
- 20.Bowers AR, Anastasio RJ, Sheldon SS, et al. Can we improve clinical prediction of at-risk older drivers? Accident Anal Prev 2013;59:537–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Anstey KJ, Horswill MS, Wood JM, et al. The role of cognitive and visual abilities as predictors in the multifactorial model of driving safety. Accident Anal Prev 2012;45:766–74. [DOI] [PubMed] [Google Scholar]
- 22.Faye EE. Contrast sensitivity tests in predicting visual function. Int Congress Ser 2005;1282:521–4. [Google Scholar]
