. Author manuscript; available in PMC: 2024 Dec 1.

Published in final edited form as: Artif Intell Med. 2023 Nov 1;146:102701. doi: 10.1016/j.artmed.2023.102701

Table 2:

Examples of frequently used linguistic features

Levels of linguistic features	Description	Methods	Some example methods explained in selected studies
Lexical features	Numerical characteristics of tokens in text documents, such as token count and length.	N-gram (e.g., uni- and bi-grams), capitalization (uppercase, title case), stemming, lemmatization, stopwords removal, lexicon, word embeddings	“Stop-words removal (e.g., “is”, “an”, “the”, etc.), stemming, and number to string conversation.” [Banerjee, 2019] “Lexical variances in the extraction rules [i.e., misspellings (e.g., obese* instead of obsessive)].” [Chandran, 2019] “N-grams represent concepts of serious mental illness symptomatology.” [Jackson, 2018] “Text processing included lower casing; removal of punctuation, stop words, and numbers; word stemming; and tokenization.” [Obeid, 2020]
Syntactic features	Patterns of sentence structures defined by language grammar.	Part-of-speech (POS) tags, constituency grammar, dependency grammar	“POS tagger and multi-word term identification to identify symptoms and non-symptoms were used.” [Divita, 2017] “POS tags in conjunction with knowledge engineering features generated to build a sentence classifier.” [Jackson, 2017] “Syntactic phrases representative of patients’ functional status including noun phrases (e.g. “patient”), prepositional phrases (e.g. “with pain”), and adjective/adverb phrases (e.g. “very tired”) using two reference standards.” [Pakhomov, 2011] “Syntactic patterns of concept phrases were mined from continuous, non-permuted forms of synonyms, and these patterns were used to detect discontinuous and/or permuted concept phrases.” [Torii, 2018]
Semantic features	Linguistic units of meaning-holding components that represent word meaning, such as lexicon definitions, dependency between tokens, and semantic networks.	Semantic definitions from lexicons (LOINC, SNOMEDCT, UMLS, etc.), relative temporal words (next, later, until etc.), absolute temporal expressions (a.m., p.m., etc.), meaning of the numbers (doses, levels), deidentification, topics of the section	“Semantic variances in terms of obsessive and compulsive in the extraction (alternative meanings beyond their definition in the context of Obsessive Compulsive Symptoms (OCS).” [Chandran, 2019] “Semantic keywords identifying the Altered mental status cluster of symptoms in the context of pulmonary embolism.” [Obeid, 2019] “UMLS semantic networks which are relevant to clinical findings were used” [Torii, 2018] “The symptom dictionary was based on UMLS, which includes a semantic network.” [Le, 2018]
Contextual features	Linguistic neighboring components (e.g., word, phrase, or sentence) of tokens or sentences that represent similar semantic meanings.	Negation/affirmation, complex temporal relations, discourse structure, line position, order of sections, implicit context dependent information, feature representations from pre-trained neural embeddings	“Distinguishing between instances where a patient is described as experiencing a particular symptom from instances where the texts state that the patient is not experiencing that symptom, or where it is someone else (e.g. a friend or relative) who is experiencing that specific symptom.” [Chandran, 2019] “The “conditional” context label is considered when the term is mentioned in the following context (e.g., “I recommended nitroglycerin if he should develop chest pain”).” [Pakhomov, 2008] “Depending on the context, weight gain could indicate either fluid accumulation because of worsening heart failure or an improvement in appetite because of decreased gut edema associated with a higher dose of diuretics.” [Leiter, 2020] “Subject terms (e.g., ‘mother’, ‘patient’), negation terms (e.g., ‘does not’), hypothetical terms (e.g., ‘if’), temporal terms (e.g., ‘previously’) and termination terms (e.g., ‘however’).” [Iqbal, 2017]