Abstract
High-throughput phenotyping, the automated mapping of patient signs and symptoms to standardized ontology concepts, is essential for realizing value from electronic health records (EHR) in support of precision medicine. Despite technological advances, high-throughput phenotyping remains a challenge. This study compares three computational approaches to high-throughput phenotyping: a large language model (LLM) incorporating generative AI, a deep learning (DL) approach utilizing span categorization, and a machine learning (ML) approach with word embeddings. The LLM approach that implemented GPT-4 demonstrated superior performance, suggesting that large language models are poised to become the preferred method for high-throughput phenotyping ofphysician notes.
Introduction
The advent of precision medicine has intensified the need for high-throughput phenotyping of electronic health records (EHR). This task remains challenging due to the complexity and volume of physician notes. High-throughput phe-notyping which is the automated mapping of patient symptoms to standardized ontology concepts is crucial for this endeavor [1–5]. Although machine learning and deep learning methods have made progress toward this goal, their limitations highlight the need for more efficient approaches. The emergence of large language models (LLMs) like GPT-4 introduces a promising new approach to address this challenging problem. This study compares the performance of LLM, deep learning, and machine learning approaches to the high-throughput phenotyping of physician notes.
The precision medicine initiative, which aims to match treatments and outcomes with the individual characteristics of each patient, requires computable descriptions of patient signs and symptoms. These descriptions must be both detailed and automated. Despite this critical need for high-throughput methods, their implementation in human medicine has lagged behind other fields, such as agriculture [6, 7]. There is a pressing need for more efficient high-throughput methods [8–10].
Historically, natural language processing (NLP) methods to extract medical concepts from medical text evolved from rule-based and dictionary-based systems [11–13]. Second-generation systems used machine learning and statistical models to find medical concepts in text [14, 15]. The third generation of approaches to concept extraction saw the application of deep learning methods to this problem, notably RNN (recurrent neural networks) and CNN (convolutional neural networks) [16–20]. The fourth generation systems introduced transformer architecture and BERT (bidirectional encoder representations from transformers), achieving gains in performance due to improvements in attention and language understanding [21–26]. The emergence of fifth-generation large language models (LLMs), such as GPT-4 (a Generative Pre-Trained Transformer), offers new flexibility, scalability, and generalizability that allows for an attack on previously unsolvable NLP problems, including the high-throughput phenotyping of physician notes [27–29].
Building on a prior pilot study [30], which demonstrated the potential of GPT-4 to high-throughput phenotyping, this study compares three computational approaches on a larger corpus of physician notes:
LLM approach: GPT-4, that combines a large language model with generative AI capabilities;
machine learning approach (ML): NimbleMiner, that combines a machine learning classifier with word vector embeddings; and
a deep learning approach (DL): spaCy spancat, which depends o convolutional neural network and word tokenization for span categorization.
This comparison evaluates the performance of these approaches for high-throughput phenotyping, providing information on their suitability for precision medicine. To establish a ground-truth data set, we manually annotated the signs and symptoms (phenotype) of the patients described in 170 physician notes. We evaluated the accuracy, precision, and recall of the three approaches. The results show the potential for advanced computational approaches to perform high-throughput phenotyping of EHRs and foreshadow a shift towards the dominance of approaches based on large language models.
Box 1:
Instructions to GPT-4 and the human annotator for finding neurological phenotypes in physician notes
Methods
Data Acquisition: We analyzed physician notes from electronic health records (EHR) of neurology patients diagnosed with multiple sclerosis (MS, ICD-10 code: G35), visiting the University of Illinois at Chicago Neurology Clinic between 2019 and 2022. The physician notes were extracted from the REDCap (Research Electronic Data Capture) system as a CSV file. To avoid analyzing notes with little substantive content, we selected the first physician progress note for each patient that contained at least 600 words. We excluded discharge summaries, admission notes, and consultation notes.. To facilitate conversion to the JSONL format, non-ASCII characters and quotation marks were removed.The note corpus consisted of 258 physician notes, 188 in the training set, and 170 in the test set.
Selection of Phenotype Categories: The 20 phenotype categories (Box 1) were selected based on their frequency in and clinical relevance to multiple sclerosis (MS) [31]. Although many granular terms are available to describe neurology phenotypes in the Human Phenotype Ontology, we chose to ‘roll up’ these terms into 20 high-level categories to increase the interpretability and comprehensibility of the findings.
Phenotyping by Human Annotator: Ground-truth labels for the notes were generated using the Prodigy annotation tool (Explosion AI, Berlin). All 258 physician notes were manually annotated. The task involved identifying text spans corresponding to one of 20 neurological signs and symptoms categories. The initial phase of annotations used the spans.manual recipe in Prodigy. After the first ten notes were annotated, a preliminary spaCy spancat model was trained, allowing a switch to the spans.correct recipe, which suggests potential spans (Figure 1). The human annotator received the same instructions as the GPT-4 API (Box 1). Our methods for phenotype annotation have been previously described, including high levels of inter-rater agreement (κ = 0.85) [32].
Figure 1:
Annotations screens for Prodigy for text spans indicating weakness. The annotator has a choice of 20 labels for selected text spans.
Phenotyping by machine learning approach: NimbleMiner [33] is a tool for the recognition of medical concepts in clinical texts that are implemented in R (compatible with version 4.0.2 [34]). It combines machine learning classifiers with word embeddings (word2vec) to identify medical concepts. By transforming seed terms into an internal lexicon called simclins, NimbleMiner uses machine learning classifiers to find matching phrases in clinical narratives. Nim-bleMiner adheres to a label classification strategy that uses only positive labels and excludes negated concepts. An initial list of signs and symptoms of multiple sclerosis was augmented by text spans from neurology notes (Figure 3a). NimbleMiner uses prenegations, terms that precede and negate a span (e.g., no sign of weakness), and postnegations, terms that follow and negate a span (e.g., weakness negative) to exclude negated phenotypes. We selected a support vector machine (SVM) classifier within NimbleMiner due to the known proficiency of SVM classifiers for text categorization [35]. The SVM classifier determined the binary presence of the 20 neurological phenotypes in physician notes (Figure 3b).
Figure 3:
Examples of seed terms used to generate simclins (a) and examples of positive and negated text spans for phenotype identified by NimbleMiner (b).
Phenotyping by deep learning approach: We used the spaCy spancat pipeline (Explosion AI, Berlin) to recognize the 20 target phenotype labels. We implemented the default tok2vec component for token-to-vector encoding and the default spancat component for span categorization. Initially, we set the parameters in the config.cfg file to their default values, including the system, components, training, batch, and initialization sections. Our initial training dataset for the spaCy spancat pipeline was 188 physician notes with 11,688 annotated lines. We used the data-to-spacy recipe from Prodigy (Explosion AI) to create training and validation sets. The initial F score was unsatisfactory at 0.34, and a class imbalance with poor recall in minority classes was recognized (Figure 2). Manually created synthetic data with more examples from the underrepresented speech, seizure, and tremor classes were added. Furthermore, due to the problematic dual representation of hyporeflexia, hyperreflexia, and weakness as text and numeric values (see Figure 1, additional examples of numerically encoded phenotypes were added, resulting in a dataset of 15,052 annotated lines. With data augmentation, the F in the validation dataset increased to 0.62. We then added transfer learning from a previously trained spancat model to achieve an F score of 0.77. Although we attempted to improve spancat performance further by adding external word vectors and a transformer architecture, software version incompatibility prevented us from implementing those improvements. The outputted model-best was applied to the unseen test dataset to find neurological phenotypes in each physician note.
Figure 2:
Due to class imbalance in the training dataset for the deep learning spancat model, synthetic data was added, increasing the number of lines annotated from 11,688 to 15,052 and thus increasing the minority classes ON (optic neuritis), seizure, sleep, and tremor. In addition, additional training examples were added to the hyperreflexia, hyporeflexia, and weakness classes due to low recall in these classes (see discussion)
Phenotyping by LLM approach. The instructions for phenotyping the physician notes were passed to GPT-4 API [36] as a prompt (Box 1). Initial modifications to the prompt were made interactively with GPT-4 in the chat mode. We used the GPT-4 chat mode to resolve ambiguities in the prompt, such as whether to categorize ‘facial weakness’ as a finding of weakness or a finding of cranial nerve. Further prompt modifications were needed to produce GPT-4 output that could be processed by Python. We wrote a Python script that iterated through the physician notes and submitted them to the GPT-4 API (OpenAI) for processing. No speed or complexity limitations were experienced with the high-throughput phenotyping of the 170 physician notes (each approximately 1,000 words). GTP-4 generated a list of phenotypes (parsable by a Python script) (Figure 4a) and a brief explanation of its choices (Figure 4b). The GPT-4 output was saved for further analysis.
Figure 4:
For physician note, GPT-4 outputted a list of phenotypes (a) and explanations for its choices (b).
Calculation of Performance Metrics. The ground-truth labels for each physician note were stored in the Prodigy SQLite database. We used Python to convert the ground-truth annotations into a rectangular array where the first column was the Record ID, and the next 20 columns held the binarized value for each of the 20 phenotypes (present or absent). We created similar arrays for the binarized predictions of the machine learning, deep learning, and LLM approaches to phenotyping. To evaluate the performance of the computational approaches for high-throughput phenotyping, we selected precision, recall, and accuracy as our metrics. These metrics were chosen for their direct interpretability and specific relevance to the binary classification tasks. Python was used to calculate accuracy, precision, and recall by standard methods [37]. A micro average was calculated for each of the twenty phenotype categories, and an unweighted overall macro average across all categories.
Human Studies: The research was approved by the Institutional Review Board of the University of Illinois at Chicago. All physician notes were unidentified and held in a REDCap database [38]. GPT-4 did not retain or re-use patient data.
Results
We compared the performance of three high-throughput phenotyping approaches. After iterative improvements to the machine learning approach (NimbleMiner) and the deep learning approach (spaCy spancat), all showed good accuracy (Figure 5). The LLM (GPT-4) approach performed the best (0.88), followed by the machine learning approach (NimbleMiner) (0.81) and the deep learning approach (spaCy spancat) (0.78). Precision (with higher scores reflecting lower false positive rates) was high for all three approaches, although the LLM approach performed best. Recall (with higher scores reflecting lower false negative rates) was higher for the LLM approach (0.77), lower for the machine learning approach (0.65), and lowest for the deep learning approach (0.42).
Figure 5:
Heat map showing precision, recall, and accuracy for three high-throughput phenotyping approaches: machine learning (ML), deep learning (DL), and LLM. Individual phenotype category metrics are micro averages; the overall metrics are macro averages. Abbreviations include CN (cranial nerve and brainstem), EOM (extraocular eye movements), and ON (optic neuritis). The category paresthesias includes sensory loss, numbness, and tingling). The heat map has ‘coolwarm’ coloration so that that ‘red’ indicates higher accuracy, recall, and precision and ‘blue’ indicates lower scores. LLM outperformed the DL and ML approaches on overall scores.
Both the training data set (Figure 2) and the test dataset (not shown) demonstrated class imbalances. In particular, the classes of tremor, speech, and seizures were underrepresented. Three phenotype classes (hyporeflexia, hyper-reflexia, weakness) were dual encoded by physicians in notes that used descriptive text or numeric scores, such as weakness documented as text (‘leg with weakness) or as the numerical score (‘Hip Flexors 3 4’) (see Figure 1).
The LLM approach outperformed machine learning and deep learning approaches for recall in most phenotype categories, with pain and seizure being the exceptions. Although the superiority of the LLM approach in precision for individual phenotypes was less pronounced, it surpassed the machine learning and deep learning approaches in several categories, including speech, sleep, ON, incoordination, fatigue, and CN. The LLM approach consistently outperformed the other approaches in macro-level overall performance metrics, including accuracy, precision, and recall (Figure 5).
Discussion
High-throughput phenotyping of patient data, crucial to advancing precision medicine, involves converting signs and symptoms from clinical notes into computable codes. Given the number of electronic health records and various linguistic challenges that include synonymy, polysemy, irregular abbreviations, colloquialisms, misspellings, and nonstandard terminologies, automated methods are essential but face significant obstacles. To be useful, the phenotyping of text stored in EHRs must be fast, accurate, and detailed [3].
We performed high-throughput phenotyping on 170 physician notes using three computational approaches: machine learning, deep learning, and LLM. All notes were written by neurologists and carried a diagnosis of multiple sclerosis. Phenotyping involved finding the number of occurrences of 20 categories of neurological symptoms. Since writing styles and habits differ between physicians (some repeat signs and symptoms multiple times in their notes, others do not), the results were binarized so that the occurrence of a neurological sign or symptom (phenotype) was recorded as ‘present’ or ‘absent’ in each note. A further limitation of this study is that we performed phenotyping of neurological signs at the level of 20 high-level categories and not at the individual phenotype term level.
The machine learning, deep learning, and LLM approaches performed at high levels of accuracy (0.81, 0.77, and 0.88, respectively) (see Figure 5). These accuracies are impressive, given that the level of agreement between human annotators reaches a ceiling at κ ≈ 0.85 [32]. The superior performance of the LLM approach is notable given the complexity of this multiclass classification task with high number of classes and class imbalances [39, 40]. In particular, the deep learning approach (spaCy spancat) faced challenges due to low counts in minority classes (seizures, sleep, and EOM), as shown in Figure 5. This difficulty was partially addressed by class rebalancing using synthetic data. Dual encoding certain phenotypes as a numerical score and a textual description (see Figure 1) proved challenging for all the approaches, but less so for the LLM approach.
The superior performance of the LLM method on minority phenotype classes (speech, tremor, seizure) and dually encoded phenotype classes (hyporeflexia, hyperreflexia, weakness) is notable. With its superior performance in underrepresented classes and its better performance in dually represented phenotypes, GPT-4 demonstrated the ability to handle class imbalances and decode mixed-format data, likely related to its extensive pre-training. These results highlight the potential role of the LLM approach in high-throughput phenotyping of physician notes. The extensive pre-training of GPT-4 gave it advantages over the other approaches in analyzing misspelled, irregular, or ambiguous text. Furthermore, the LLM (GPT-4) method offered explanations for its selections without prompting (Figure 4), suggesting that advances in explanatory AI (XAI) [41] had been incorporated into the model architecture.
Several differences in the ease of implementation between the three approaches should be mentioned. The implementation of the LLM method (GPT-4) was straightforward. We used the GPT-4 chat mode to refine the prompt for high-throughput phenotyping (Box 1). When we implemented the GPT-4 API, additional changes were needed in the prompt to resolve ambiguities and obtain the appropriate output for suitable parsing by Python (Figure 4a). The configuration of the machine learning approach (NimbleMiner) required a meticulous selection of seed terms for each of the 20 phenotype categories and rigorous curation of the generated simclins. We went through several iterations of seed generation and simclin curation until acceptable levels of accuracy were obtained. Implementing the deep learning approach (spaCy spancat) was the most time-consuming. We created a training dataset by annotating physician notes for the initial spancat pipeline. Due to poor model accuracy in minority classes (especially low recall) and class imbalances, additional model training was performed with synthetic data. Further improvements in the performance of the spaCy spancat pipeline required the implementation of transfer learning from a previously trained model.
The implementation of high-throughput neurological phenotyping was easiest with the LLM approach. Furthermore, the LLM Approach outperformed the machine learning and deep learning approaches in accuracy and recall (Figure 5). Although our results with the LLM approach (GPT-4) are encouraging, confirmation of these results with a larger and more diverse corpus of physician notes is needed. Several limitations of this study should be mentioned. High-throughput phenotyping was performed with a limited number of neurological notes, all with a diagnosis of multiple sclerosis. High-throughput phenotyping on more notes with different diagnoses should be studied. The phenotyping was done at a coarse level of detail. We used 20 broad categories for the phenotyping. However, pheno-typing can be performed at a more granular level. For example, the Human Phenotype Ontology has approximately 17,000 terms to document human phenotypes [42]. The ability of the LLM method to phenotype at higher levels of granularity should be studied. Additional fine-tuning of the machine learning approach (NimbleMiner) would likely have improved accuracy. Additional seed terms and simclin curation could have improved performance in some low-performing categories such as ‘cognitive’, ‘sphincter’, and ‘EOM’. Modifications to the deep learning approach (spaCy spancat) would probably have improved performance. Changes that would likely have improved performance include adding a transformer architecture to the pipeline, adding specialized pre-trained word vectors to the pipeline, additional training examples, and better balancing of the phenotype classes in the training dataset.
This study is an indication of the power, simplicity and generalizability of large language model approaches when applied to high-throughput phenotyping of text within EHRs. Large language models (LLMs) are poised to become the dominant approach to high-throughput phenotyping. GPT-4 outperformed more traditional approaches and proved easier to implement than deep learning or machine learning approaches. A broader integration of large language models into electronic health records for phenotyping will depend on additional research that validates these findings with different note types, different EHR data types, and in different medical fields. If large language models are used in patient care, a determination of their regulatory status will be needed, as well as an evaluation of safety, privacy, and security concerns. An assessment of the accuracy of large language models needed for high-throughput phenotyping in clinical applications is needed. Large language models are a significant advance in high-throughput phenotyping of physician notes. Greater accuracy can be expected with additional training and fine-tuning of the underlying models.
References
- [1].Sahu M, Gupta R, Ambasta RK, Kumar P. Artificial intelligence and machine learning in precision medicine: A paradigm shift in big data analysis. Progress in Molecular Biology and Translational Science. 2022;190(1):57100. doi: 10.1016/bs.pmbts.2022.03.002. [DOI] [PubMed] [Google Scholar]
 - [2].Afzal M, Islam SR, Hussain M, Lee S. Precision medicine informatics: principles, prospects, and challenges. IEEE Access. 2020;8:13593–612. [Google Scholar]
 - [3].Robinson PN. Deep phenotyping for precision medicine. Human mutation. 2012;33(5):777–80. doi: 10.1002/humu.22080. [DOI] [PubMed] [Google Scholar]
 - [4].Hier D, Yelugam R, Azizi S, Wunsch D. A focused review of deep phenotyping with examples from neurology. Eur Sci J. 2022;18:4–19. [Google Scholar]
 - [5].Hier D, Yelugam R, Azizi S, Carrithers M, Wunsch I. DC. High throughput neurological phenotyping with MetaMap. Eur Sci J. 2022;18:37–49. [Google Scholar]
 - [6].Mir RR, Reynolds M, Pinto F, Khan MA, Bhat MA. High-throughput phenotyping for crop improvement in the genomics era. Plant Science. 2019;282:60–72. doi: 10.1016/j.plantsci.2019.01.007. [DOI] [PubMed] [Google Scholar]
 - [7].Gehan MA, Kellogg EA. High-throughput phenotyping. American journal of botany. 2017;104(4):505–8. doi: 10.3732/ajb.1700044. [DOI] [PubMed] [Google Scholar]
 - [8].Alzoubi H, Alzubi R, Ramzan N, West D, Al-Hadhrami T, Alazab M. A review of automatic phenotyping approaches using electronic health records. Electronics. 2019;8(11):1235. [Google Scholar]
 - [9].Pathak J, Kho AN, Denny JC. Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. Journal of the American Medical Informatics Association. 2013;20(e2):e206–11. doi: 10.1136/amiajnl-2013-002428. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [10].Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, et al. A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association. 2014;21(2):221–30. doi: 10.1136/amiajnl-2013-001935. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [11].Krauthammer M, Nenadic G. Term identification in the biomedical literature. Journal of biomedical informatics. 2004;37(6):512–26. doi: 10.1016/j.jbi.2004.08.004. [DOI] [PubMed] [Google Scholar]
 - [12].Eltyeb S, Salim N. Chemical named entities recognition: a review on approaches and applications. Journal of cheminformatics. 2014;6(1):1–12. doi: 10.1186/1758-2946-6-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [13].Quimbaya AP, Münera AS, Rivera RAG, Rodríguez JCD, Velandia OMM, Peña AAG, et al. Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Computer Science. 2016;100:55–61. [Google Scholar]
 - [14].Hirschman L, Morgan AA, Yeh AS. Rutabaga by any other name: extracting biological names. Journal of Biomedical Informatics. 2002;35(4):247–59. doi: 10.1016/s1532-0464(03)00014-5. [DOI] [PubMed] [Google Scholar]
 - [15].Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association. 2011;18(5):552–6. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [16].Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. arXiv preprint arXiv:160301360. 2016.
 - [17].Chiu JP, Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics. 2016;4:357–70. [Google Scholar]
 - [18].Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017;33(14):i37–48. doi: 10.1093/bioinformatics/btx228. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [19].Gehrmann S, Dernoncourt F, Li Y, Carlson ET, Wu JT, Welt J, et al. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PloS one. 2018;13(2):e0192360. doi: 10.1371/journal.pone.0192360. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [20].Arbabi A, Adams DR, Fidler S, Brudno M, et al. Identifying clinical terms in medical text using ontology-guided machine learning. JMIR medical informatics. 2019;7(2):e12596. doi: 10.2196/12596. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [21].Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.
 - [22].Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017. pp. p. 5998–6008.
 - [23].Zhu R, Tu X, Huang JX. Data Analytics in Biomedical Engineering and Healthcare. Elsevier; 2021. Utilizing BERT for biomedical and clinical text mining; pp. p. 73–103. [Google Scholar]
 - [24].Yu X, Hu W, Lu S, Sun X, Yuan Z IEEE. BioBERT based named entity recognition in electronic medical record. 2019 10th international conference on information technology in medicine and education (ITME) 2019. pp. 49–52.
 - [25].Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [26].Ji Z, Wei Q, Xu H. Bert-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings. 2020;2020:269. [PMC free article] [PubMed] [Google Scholar]
 - [27].Yan C, Ong H, Grabowska M, Krantz M, Su WC, Dickson A, et al. Large Language Models Facilitate the Generation of Electronic Health Record Phenotyping Algorithms. medRxiv. 2023:2023–12. doi: 10.1093/jamia/ocae072. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [28].Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, et al. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns. 2023. [DOI] [PMC free article] [PubMed]
 - [29].Wang A, Liu C, Yang J, Weng C. Fine-tuning Large Language Models for Rare Disease Concept Normalization. bioRxiv. 2023. pp. 2023–12. [DOI] [PMC free article] [PubMed]
 - [30].Munzir SI, Hier DB, Carrithers MD. High Throughput Phenotyping of Physician Notes with Large Language and Hybrid NLP Models. ArXiv. 2024. Accessed March 12, 2024. Available from: https://arxiv.org/ abs/2403.05920. [DOI] [PubMed]
 - [31].Howlett-Prieto Q, Oommen C, Carrithers MD, Wunsch DC, Hier DB. Subtypes of relapsing-remitting multiple sclerosis identified by network analysis. Frontiers in Digital Health. 2023;4:1063264. doi: 10.3389/fdgth.2022.1063264. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [32].Oommen C, Howlett-Prieto Q, Carrithers MD, Hier DB. Inter-rater agreement for the annotation of neurologic signs and symptoms in electronic health records. Frontiers in Digital Health. 2023;5:1075771. doi: 10.3389/fdgth.2023.1075771. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - [33].Topaz M, Murga L, Bar-Bachar O, McDonald M, Bowles K. NimbleMiner: an open-source nursing-sensitive natural language processing system based on word embedding. CIN: Computers, Informatics, Nursing. 2019;37(11):583–90. doi: 10.1097/CIN.0000000000000557. [DOI] [PubMed] [Google Scholar]
 - [34].R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2020. Available from: https://www.R-project.org/
 - [35].Joachims T. European conference on machine learning. Springer; 1998. Text categorization with support vector machines: Learning with many relevant features; pp. p. 137–42. [Google Scholar]
 - [36].OpenAI. ChatGPT. Large language model. 4 2024. Available from: https://chat.openai.com.
 - [37].Powers DM. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:201016061. 2020.
 - [38].Patridge EF, Bardyn TP. Research electronic data capture (REDCap) Journal of the Medical Library Association: JMLA. 2018;106(1):142. [Google Scholar]
 - [39].Grandini M, Bagli E, Visani G. Metrics for multi-class classification: an overview. arXiv preprint arXiv:200805756. 2020.
 - [40].Aly M. Survey on multiclass classification methods. Neural Netw. 2005;19(1-9):2. [Google Scholar]
 - [41].Minh D, Wang HX, Li YF, Nguyen TN. Explainable artificial intelligence: a comprehensive review. Artificial Intelligence Review. 2022. pp. 1–66.
 - [42].Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, et al. The human phenotype ontology in 2021. Nucleic acids research. 2021;49(D1):D1207–17. doi: 10.1093/nar/gkaa1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
 






