Abstract
Purpose:
A preliminary version of a paraphasia classification algorithm (henceforth called ParAlg) has previously been shown to be a viable method for coding picture naming errors. The purpose of this study is to present an updated version of ParAlg, which uses multinomial classification, and comprehensively evaluate its performance when using two different forms of transcribed input.
Method:
A subset of 11,999 archival responses produced on the Philadelphia Naming Test were classified into six cardinal paraphasia types using ParAlg under two transcription configurations: (a) using phonemic transcriptions for responses exclusively (phonemic-only) and (b) using phonemic transcriptions for nonlexical responses and orthographic transcriptions for lexical responses (orthographic-lexical). Agreement was quantified by comparing ParAlg-generated paraphasia codes between configurations and relative to human-annotated codes using four metrics (positive predictive value, sensitivity, specificity, and F1 score). An item-level qualitative analysis of misclassifications under the best performing configuration was also completed to identify the source and nature of coding discrepancies.
Results:
Agreement between ParAlg-generated and human-annotated codes was high, although the orthographic-lexical configuration outperformed phonemic-only (weighted-average F1 scores of .78 and .87, respectively). A qualitative analysis of the orthographic-lexical configuration revealed a mix of human- and ParAlg-related misclassifications, the former of which were related primarily to phonological similarity judgments whereas the latter were due to semantic similarity assignment.
Conclusions:
ParAlg is an accurate and efficient alternative to manual scoring of paraphasias, particularly when lexical responses are orthographically transcribed. With further development, it has the potential to be a useful software application for anomia assessment.
Supplemental Material:
Anomia is the hallmark deficit of individuals with poststroke aphasia (Goodglass & Wingfield, 1997). In widely accepted models of spoken word production, anomia is understood to reflect disruptions in lexical–semantic and/or phonological processing (e.g., Dell, 1986; Levelt et al., 1999). For example, Dell's (1986) two-step spreading activation model posits that the first stage of lexical access is dominated by activation of semantic associations that map to a specific word form, or lemma, whereas the second stage involves assignment of the appropriate phonemic representations to construct the phonological code that is then processed for motor planning and execution (e.g., Levelt et al., 1999; Walker & Hickok, 2016a).
Patterns of anomic errors, also referred to as paraphasias, are thus understood to represent impaired processes critical to spoken word production. For instance, productions that arise from impairments in lexical–semantic processing commonly result in errors that differ from the intended target along a semantic dimension, such as saying “dog” for cat. In contrast, deficits in phonological encoding may yield errors that resemble the phonemes of the intended target, resulting in real-word productions, such as “cat” for hat, or nonwords (i.e., neologisms) that may or may not be phonologically related to the target, such as “tat” for hat and “blil” for hat, respectively. The focus of this study is on the development and validation of machinery that automatically classifies paraphasias in the context of a commonly used picture naming test (Philadelphia Naming Test [PNT]; Roach et al., 1996).
Accurate characterization of paraphasias has important clinical implications. First, paraphasia types may be useful in prognosticating about recovery outcomes and providing informed education to individuals with aphasia and their caregivers. For example, certain paraphasia types, such as neologisms, have been shown to be more sensitive to overall anomia severity, whereas others, such as semantically based real-word errors, remain stable across the naming ability continuum (e.g., Dell et al., 1997). Furthermore, over the course of recovery, paraphasias most sensitive to naming severity have been shown to improve at a greater magnitude (e.g., Schwartz & Brecher, 2000) and are associated with overall improvement in language skills (e.g., Meier et al., 2020). Second, paraphasia classification may be useful in selecting an appropriate treatment program, as many have been developed with the specific goal of remediating deficits in phonological encoding, by targeting either spoken language (e.g., Kendall et al., 2008, 2015; Leonard et al., 2008) or written language (e.g., Beeson et al., 2010, 2018; DeMarco et al., 2018), whereas others have been developed specifically with the goal of strengthening lexical–semantic representations (e.g., Boyle & Coelho, 1995; Coelho et al., 2000).
Accurate analysis of paraphasias has important implications for theoretical investigations as well. Specifically, models of spoken language production are dependent on analyzing distributions of semantically and phonologically mediated errors to help answer questions regarding the cognitive architecture that governs neurotypical and impaired language populations. For example, initial computational efforts by Dell et al. (1997), based on the two-step spreading activation model, examined distributions across correct responses and five error categories to derive estimates of lexical–semantic and phonological processing, expressed as s and p weights, in individuals with and without aphasia. More recently, in an effort to better account for the phonological errors observed in individuals with conduction aphasia, Walker and Hickok (2016a) expanded upon Dell's work to develop the semantic–lexical–auditory–motor model, and error distributions generated from 255 patients with aphasia demonstrated improved fit for the group in question, suggesting that their model, which additionally accounts for auditory–motoric feedback, may provide a more explanatory account of spoken language production. While questions regarding the significance of these findings in the context of alternative models remain (see Goldrick, 2016; Walker & Hickok, 2016b), these exchanges illustrate the value of accurate paraphasia analysis in exploring theoretical assumptions regarding spoken language production. Consequently, accurately, reliably, and efficiently characterizing the relationship between paraphasic productions and their respective targets is of high clinical and theoretical importance.
The PNT, a 175-item picture naming test, is among the most widely studied tools for the assessment of anomic deficits in poststroke aphasia. Its ubiquitous use in a variety of research investigations can be attributed to its strong theoretical basis (e.g., Dell et al., 1997; Schwartz et al., 2006) and robust psychometric properties, including excellent interrater and test–retest reliability (e.g., Roach et al., 1996; Walker & Schwartz, 2012). Furthermore, advanced psychometric methods, such as item response theory, have been successfully utilized to derive estimates of item difficulty (Fergadiotis, Hula, et al., 2015), develop a computer-adaptive testing paradigm for the assessment of overall anomia severity (Fergadiotis, Hula, et al., 2019; Hula et al., 2015, 2020), and predict item difficulty estimates (Fergadiotis, Swiderski, & Hula, 2019). Efforts such as these demonstrate the strength of the PNT as a tool for diagnostic inquiry.
There are, however, disadvantages to the PNT that limit its applicability to clinical and research settings. First, although the PNT has been subjected to extensive psychometric evaluation, this work has primarily focused on dichotomous accuracy scores, and little is known about the reliability of its polytomous paraphasia classification scheme, which, as noted above, is of particular value to both clinical and theoretical research. Current evidence suggests that although inter- and intrarater agreement among scorers may be high overall, there also is considerable variability when assigning polytomous codes (e.g., Walker et al., 2022). This variability likely stems from the complexity of the classification scheme (see https://mrri.org/philadelphia-naming-test/ 1 ). For example, it requires judgments of semantic relatedness along multiple dimensions (e.g., taxonomic, associative, morphological), a task that has been previously shown to demonstrate suboptimal interrater agreement (e.g., Hill et al., 2015; Huang et al., 2012). Moreover, judgments of phonological similarity, although deterministic, require the application of numerous rules regarding shared phonemic and phonetic features. Second, phonological similarity judgments are dependent on the use of phonemic transcription for all responses, which is a time-consuming and training-intensive procedure that introduces an additional potential source of error in the overall scoring procedure. As such, both the use of phonemic transcription and the intricacy of scoring procedures make PNT paraphasia classification a costly and potentially error-prone endeavor.
Algorithmic classification of behavioral test scores, and specifically paraphasias, has the potential to be a reliable and efficient alternative to manual scoring methods. The use of computer algorithms for evaluation of language and cognitive processing has an expanding literature base, with recent efforts demonstrating their ability to detect grammatical errors (Leacock et al., 2010), analyze articulatory patterns (Dudy et al., 2015; Moustroufas & Digalakis, 2007), and automatically score neuropsychological assessments (Prud'hommeaux & Roark, 2011; Salem et al., 2021). Drawing upon these efforts, we developed a preliminary version of a paraphasia algorithm, henceforth called ParAlg, that was designed to approximate or directly imitate the scoring classification scheme of the PNT for six cardinal types of paraphasias (Fergadiotis et al., 2016). In brief, this preliminary version of ParAlg consisted of three separate binary classifiers that assigned the presence or absence of a feature along three linguistic dimensions: a lexicality classifier, which identified real words from neologisms; a phonological similarity classifier, which identified real-word phonological errors; and a semantic similarity classifier, which identified real-word semantic errors. In other words, a given paraphasia was analyzed multiple times and received separate classifications. For example, a semantically related real-word paraphasia would receive three separate classifications: (a) real word, as derived from the lexicality classifier; (b) phonologically dissimilar, as derived from the phonological similarity classifier; and (c) semantically similar, as derived from the semantic similarity classifier. Results of our initial testing showed high agreement between each ParAlg classifier and human-annotated judgments. Sensitivity ranged from .77 (semantic similarity) to .98 (lexicality) and specificity ranged from .91 (phonological similarity) to .92 (lexicality, semantic similarity) across the three classifiers (Fergadiotis et al., 2016). These findings suggest that the use of algorithmic classification has the potential to be a cost-effective, efficient, and reliable approach to paraphasia analysis.
Our algorithmic approach was not without limitations, however. First, the preliminary version of ParAlg treated paraphasia classification as three separate binary classifiers, ignoring the necessary integration of classifiers to arrive at a single paraphasia code (see Table 1 for illustration). Although agreement at the binary classification level was high, no individual classifier demonstrated perfect agreement, and error from each will ultimately create a multiplicative effect, as classifiers are necessarily combined to assign a paraphasia code. Thus, quantifying agreement with multinomial classification (i.e., treating all three classifiers instead as feature assignments to arrive at a single paraphasia classification) is necessary for developing ParAlg as a diagnostic tool. In other words, a given paraphasia would instead be analyzed once and receive a single classification. Contrasting with the example from the preceding paragraph, a semantically related real word would receive three binary feature assignments (i.e., + lexicality, − phonological similarity, and + semantic similarity) that are then used simultaneously to produce a single classification: semantic paraphasia. Furthermore, ParAlg is ultimately intended to be used as part of a larger diagnostic system, where other elements may introduce additional noise (e.g., automatic speech recognition, computer-adaptive testing algorithms). As such, understanding the reliability of each element in the system will be critical for such efforts.
Table 1.
Linguistic feature assignment across the six paraphasia codes of interest.
| Paraphasia code | Lexicality | Phonological similarity | Semantic similarity |
|---|---|---|---|
| Abstruse Neologism | − | − | − |
| Formal | + | + | − |
| Mixed | + | + | + |
| Other | + | − | − |
| Neologism | − | + | − |
| Semantic | + | − | + |
| Unrelated | + | − | − |
Note. The “+” sign indicates the presence of a given feature, and the “–” sign indicates the absence of a given feature.
Second, our preliminary development work relied exclusively on human-annotated phonemic transcription as input to classify all responses. Phonemic transcription itself is time intensive and, in the case of real-word responses, potentially more error prone than orthographic transcription, as there often are numerous potential pronunciations for a single orthographic word form. As such, the use of human-annotated orthographic transcription of real-word responses as input to the ParAlg system could be a simple way to enhance the implementation of ParAlg in clinical settings, where extensive transcription is not feasible. These two different forms of input would represent distinct ways in which ParAlg could be used—one where the system is fully automated (phonemic transcriptions for all responses) and another where the system receives a “helping hand” from the clinician (orthographic transcriptions for real-word responses).
Finally, previous work used an approach to error analysis that looked at only a small sample of responses (Fergadiotis et al., 2016). A comprehensive error analysis of the complete data set, where discrepancies between human-annotated and algorithm-generated paraphasia codes are reviewed qualitatively on an item-by-item level, is critical to software development. This debugging approach allows for identification and subsequent remediation of errors within the coding pipeline that are not readily apparent when examining overall performance.
The purpose of this study was to present an updated version of ParAlg that addresses the aforementioned limitations and comprehensively evaluate its performance relative to human-annotated codes using a large data set of responses on the PNT. Our research questions were as follows: (a) What is the agreement between human annotators and our algorithmic approach at the level of paraphasia code assignment when providing two forms of transcribed input? (b) Is paraphasia code agreement between human annotators and ParAlg systematically higher for one form of transcribed input versus another? A secondary aim was to provide a complete item-level qualitative analysis of paraphasia code discrepancies from the best performing transcription input to (a) identify areas for future coding improvements and (b) characterize patterns and the extent of paraphasia misclassification by both human annotators and ParAlg.
Method
Participants and Data Set Preparation
First, administrations of the PNT for all individuals with a diagnosis of aphasia (n = 296) were downloaded from the Moss Aphasia Psycholinguistics Project Database (MAPPD; Mirman et al., 2010). Details regarding demographic and clinical characteristics can be found in Table 2. Responses consisting of single-word paraphasias were then extracted, yielding an 11,999-item data set, henceforth referred to as MAPPD-12K. MAPPD-12K was then reviewed by the first author (M.C.), and all duplicate target–response pairs (i.e., instances where the transcriptions and human-annotated paraphasia code were in exact agreement) were hand-coded and removed from the item-level qualitative analysis in an effort to reduce coding burden. For a detailed summarization of the data set preparation process, see Supplemental Material S1.
Table 2.
Demographic and clinical characteristics of the participant sample.
| Characteristic | Value |
|---|---|
| Age (years) | |
| M (SD), range | 58.8 (13), 22–86 |
| Missing (%) | 6.4 |
| Education (years) | |
| M (SD), range | 13.8 (2.9), 6–22 |
| Missing (%) | 6.4 |
| Ethnicity (%) | |
| African American | 39.2 |
| African American/Hispanic | 0.3 |
| Asian | 0.3 |
| White | 53 |
| White/Hispanic | 0.7 |
| Missing | 6.4 |
| Gender (%) | |
| Female | 43.9 |
| Male | 56.1 |
| Months postonset | |
| M (SD), range | 29.9 (46.8), 1–381 |
| Missing (%) | 6.4 |
| Philadelphia Naming Test (% correct) | |
| M (SD), range | 60.7 (2.8), 1.1–98.3 |
| Western Aphasia Battery–Aphasia Quotient (Kertesz, 2006) | |
| M (SD), range | 73.3 (17.8), 25.2–99.3 |
| Missing (%) | 32.8 |
Transcription Configurations
Each item, or target–response pair, in MAPPD-12K includes an orthographic transcription of the target word; a phonemic transcription of the response; and, for responses judged by human annotators to be real words, or lexical, an orthographic transcription of the response. Importantly, both types of transcription are necessary for completion of the judgments inherent to the PNT scoring protocol (Roach et al., 1996). For response orthographic transcription, judgments of a response's lexicality are made intrinsically by the annotator, who determines whether a string of phonemes maps to a real word and, if so, assigns the appropriate orthographic form. In cases where a response is assigned an orthographic transcription, this transcription is compared to the target orthographic transcription to make judgments regarding the semantic similarity of the pair. For response phonemic transcription, the target–response pairs are compared to make judgments of phonological similarity.
MAPPD-12K is impoverished with regard to transcription in two aspects: (a) Target phonemic transcriptions are not provided, although separate files on the developer's website list transcription conventions and target phonemic transcriptions (available at https://mrri.org/philadelphia-naming-test/), and (b) response phonemic transcriptions are missing stress and syllabification markings. Moreover, orthographic and phonemic transcriptions of responses often lack a one-to-one correspondence, most commonly where the orthographic transcription indicates a lexical response while the phonemic transcription suggests the opposite. For example, a response with the orthographic transcription jewelry has the corresponding phonemic transcription /dʒɛləri/ rather than the canonical pronunciation /dʒuləri/. In cases such as these, two possibilities arise: (a) The phonemic transcription is correct and the response is nonlexical in nature, or (b) the orthographic transcription is correct and the response is lexical (i.e., a real word). Rather than infer which transcription represented the true response, our initial approach in the preliminary development of ParAlg (Fergadiotis et al., 2016) was to remove orthographic transcriptions and subsequently optimize them using an automated approach as a preparatory step for paraphasia classification. This involved reconstructing the response orthographic transcription, as well as generating a target phonemic transcription and algorithmically assigning phonetic features (i.e., stress and syllable position) to both the target and response phonemic transcription (see the Appendix for details). This configuration, in which phonemic transcriptions constitute the only human-annotated input for responses, will henceforth be referred to as the phonemic-only configuration and represents the base configuration for ParAlg in its current form (see the left side of Figure 1 for a visual walk-through).
Figure 1.

ParAlg transcription configurations. The left bifurcation of the figure shows an overview of the removal and subsequent reconstruction of response orthographic transcriptions for the phonemic-only transcription configuration of this study. The right bifurcation provides an analogous overview for the orthographic-lexical configuration. Lexical responses are defined as the assignment of the following human-annotated codes: Formal, Mixed, Semantic, or Unrelated. A detailed description of transcription reconstruction for both configurations is provided in the Appendix. MAPPD = Moss Aphasia Psycholinguistics Project Database.
To address our research aim of comparing ParAlg's agreement with human annotators when provided two different forms of transcription input, we created an alternative transcription configuration of MAPPD-12K henceforth called orthographic-lexical. Here, we relied on the human-annotated paraphasia code to determine which transcription should be retained. Specifically, response orthographic transcriptions were retained for those assigned a lexical code (i.e., Formal, Mixed, Semantic, or Unrelated) by a human annotator, and their corresponding response phonemic transcriptions were removed. ParAlg optimization of MAPPD-12K remained the same, except that a response phonemic transcription was now reconstructed and enhanced to include phonetic information (i.e., stress and syllabification) identical to that of the phonemic-only configuration, as opposed to a response orthographic transcription (see the right side of Figure 1 for a visual walk-through and the Appendix for details).
ParAlg Decision Tree Algorithm
After each transcription configuration has been specified, target–response pairs were classified as one of six cardinal paraphasia codes, as defined under the conventional coding scheme in the PNT scoring manual and briefly reiterated here: (a) Abstruse Neologism, or a nonlexical response bearing no phonological relationship to the target; (b) Formal, or a lexical response bearing phonological but not semantic similarity to the target; (c) Mixed, or a lexical response bearing phonological and semantic similarity to the target; (d) Neologism, or a nonlexical response bearing phonological similarity to the target; (e) Semantic, or a lexical response bearing semantic or select morphological similarity to the target; and (f) Unrelated, or a lexical response bearing neither semantic nor phonological similarity to the target. As is evident in the aforementioned definitions, semantic similarity has no bearing on the paraphasia code assignment of nonlexical responses.
In our updated version of ParAlg, classification takes the form of a multinomial decision tree that replicates the scoring protocol of the PNT, as illustrated in Figure 2. Bifurcations in the tree are determined by the three binary classifiers from the preliminary version of ParAlg (Fergadiotis et al., 2016). Of note, Figure 2 depicts the phonemic-only configuration, as this has been, across all versions, the default transcription configuration of ParAlg. Ways in which classification differs when using the orthographic-lexical configuration, the primary manipulation of ParAlg in this study, are outlined below.
Figure 2.

ParAlg paraphasia classification pipeline. Each bifurcation in the multinomial classification decision tree represents a binary linguistic feature assignment. Thresholding for the lexicality and semantic similarity classifiers is probabilistic and data set specific, as detailed by Fergadiotis et al. (2016). Phonological similarity thresholding is deterministic. MAPPD = Moss Aphasia Psycholinguistics Project Database; LEX = lexicality; SEM = semantic similarity; PHONO = phonological similarity.
Lexicality
First, the lexicality of a response is ascertained (see the first bifurcation of the multinomial processing tree in Figure 2). Unlike subsequent bifurcations, this process differs between transcription configurations. For the phonemic-only configuration, lexicality classification, the first bifurcation in ParAlg's multinomial decision tree, is an automated probabilistic process where a frequency count, as derived from a large American English corpus (SUBTLEXus; Brysbaert & New, 2009), is first assigned and then compared to a predetermined threshold (see McKinney-Bock & Bedrick, 2019, for further details). For the orthographic-lexical configuration, lexicality is deterministically assigned based on the presence of a response orthographic transcription. Specifically, responses possessing an orthographic transcription are lexical; responses without an orthographic transcription are nonlexical.
Semantic Similarity
Next, if the response is lexical, semantic similarity between the response and the intended target is determined (see the second bifurcation in Figure 2). For this feature, a neural network model of word meanings, word2vec (Mikolov et al., 2013), is first trained on a corpus (see Fergadiotis et al., 2016; McKinney-Bock & Bedrick, 2019, for development details). Then, the abstract representations or embeddings of each word in the model (as calculated in Mikolov et al., 2013) are obtained, and the semantic relationship between the word roots of the target–response pair in MAPPD-12K is represented numerically as the cosine similarity between the word embeddings for the target and the response. As with lexicality in the phonemic-only configuration, a precomputed similarity threshold (see Fergadiotis et al., 2016, for procedural details) was used to dichotomize pairs as semantically related or not.
Phonological Similarity
Third, phonological similarity between the target–response pairs is derived (see the third bifurcation in Figure 2). In accordance with the PNT scoring rules, phonological similarity is defined as (a) two or more matching phonemes in any position, (b) matching phoneme in the vowel carrying primary stress, (c) matching phoneme in the word-final position, (d) matching phoneme in the word-initial position, or (e) one or more matching phonemes with a shared word and syllable position (onset, vowel, or coda) when aligned left to right. Notably, phonemes within a consonant cluster in a target and/or a response are conceptualized as sharing the same word position, regardless of their order (see Figure 2 as an example). These rules are operationalized in ParAlg as a series of Boolean logic operations. Specifically, response and target phonemic transcriptions, enhanced with additional phonetic information (i.e., stress and syllabification), are assigned a “true” or “false” value for each of the three conditions, where one or more values of “true” flag the target–response pair as phonologically similar.
Analyses
Paraphasia codes and feature assignment across the three binary classifiers were generated for all responses in MAPPD-12K using ParAlg under both the phonemic-only and orthographic-lexical configurations. See Supplemental Materials S1, S2, and S3 for a description of the MAPPD-12K data set, the raw input file, and a summary output file with the item-level qualitative coding, respectively.
Agreement Between ParAlg and Human Annotators
Agreement between the human-annotated and algorithm-generated paraphasia codes across the two transcription configurations was evaluated in two manners. First, confusion matrices, which contrast counts of agreement between the two classification systems on the diagonal with counts of disagreements on the off-diagonal, were visually inspected. Second, four metrics common to assessing concordance between behavioral measures were calculated: (a) positive predictive value, (b) sensitivity, (c) specificity, and (d) F1 score. Within a computer science framework, each metric, as described below, can be understood as evaluating the extent to which the information retrieved (i.e., the algorithm's predictions) aligns with the ground truth of information relevant to the question at hand. Here, the paraphasia codes generated by ParAlg constitute the retrieved information, whereas the paraphasia codes classified by human annotators are the ground truth relevant information (i.e., the gold standard measure). A graphical depiction of the relationships among the various metrics is shown in Figure 3.
Figure 3.
Agreement metrics. Each agreement metric is derived from four values: true positives, false positives, false negatives, and true negatives. The F1 score, which is the primary metric of this study, represents the harmonic mean of two of the other metrics: positive predictive value and sensitivity. The bottom-right box of the figure shows the geometric interpretation of the harmonic mean (HM; distance between points C and D), relative to the arithmetic mean or “average” (AM; distance between points A and B) and geometric mean (GM; distance between points D and E), the latter of which is calculated by multiplying all values and taking the square root of the resultant product. In the special case of calculating a mean between two numbers (in this case, positive predictive value and sensitivity), the relationship between the harmonic mean and the other two can be expressed as . In other words, the harmonic mean attenuates the arithmetic mean with the geometric mean, which accounts for both compounding effects (in this case, the interdependence between positive predictive value and sensitivity) and penalizes outlier values (in this case, any potential imbalances between positive predictive value and sensitivity). Consequently, the harmonic mean value is always lower than the geometric mean, which in turn is always lower than the arithmetic mean. PPV = positive predictive value; Sens. = sensitivity; Spec. = specificity.
All metrics range from 0 to 1.0, with 0 representing absent agreement and 1.0 representing perfect agreement relative to the gold standard measure. Each of the four metrics was calculated individually for each of the six paraphasia types classified to quantify agreement for a single code relative to the other five codes. An average, both unweighted and weighted, was also computed for each of the four metrics across the six paraphasia types to quantify overall agreement. Calculation of unweighted and weighted-averages specifically provides a quantitative comparison of the influence of uneven prevalence across the classes (i.e., paraphasia types) of interest. As an example, the unweighted-average F1 score was calculated by summing the F1 scores for each of the six paraphasia types and dividing that summation by six (the number of types). The weighted-average F1 score, by contrast, was calculated by multiplying those same six F1 scores by their respective number of observations, summing the products, and then dividing the summation by six (the number of types). In other words, the difference between the unweighted and weighted averages is whether a metric for a given paraphasia type is weighted by its number of observations, thereby removing the effect of any prevalence imbalance among the classes.
Positive predictive value. Also referred to as precision, positive predictive value measures the proportion of ParAlg-generated paraphasia codes that were correct (i.e., instances in which ParAlg's prediction agreed with the human-annotated paraphasia code). As shown in Figure 3, it is calculated as the number of true positives (i.e., cases where ParAlg-generated and human-annotated paraphasia codes agree) over the sum of true positives and false positives (i.e., the denominator represents all cases where ParAlg assigned that particular code). Positive predictive value is commonly used in medical fields as a metric for evaluating the diagnostic utility of a test to screen disease (e.g., Fletcher, 2019) and has been more recently adapted to the field of computer science in evaluating the proportion of relevant information retrieved by a system (e.g., Tharwat, 2021).
Sensitivity. Also known as the true-positive rate, sensitivity is the proportion of instances that a given human-annotated paraphasia code was identified as such by ParAlg. As shown in Figure 3, it is calculated as the number of true positives over the sum of true positives and false negatives (i.e., the denominator represents all cases where the human annotator assigned that particular code).
Specificity. In contrast to sensitivity, specificity represents the proportion of instances for which ParAlg will agree with the human annotator when a response does not belong to a given paraphasia category. As shown in Figure 3, it is calculated as the number of true negatives (i.e., cases where ParAlg-generated and human-annotated codes agreed that a given paraphasia code was one of the five other possible codes) over the total number of human-annotated paraphasia codes in the other categories.
F1 score. An overall metric of agreement, the F1 score combines positive predictive value and sensitivity metrics into a single numerical value by taking their harmonic mean, as shown in Figure 3. The necessity for combining the two metrics comes from limitations in their individual calculations. Specifically, the formula for positive predictive value does not include false negatives, whereas the formula for specificity does not include false positives. The functional consequence of these calculations is that classification models with high positive predictive likelihood may not retrieve relevant information, whereas models with high sensitivity may retrieve information that is not relevant. The F1 score, as such, provides information about both the extent to which a classification model retrieves information and the relevancy of that information.
The F1 score is a well-established and widely used measure of classifier performance in information retrieval and machine learning (Lipton et al., 2014; Manning et al., 2008; van Rijsbergen, 1979). Although not commonly applied to behavioral measures in aphasiology, F1 scores have been used in quantifying neuroimaging data (often referred to here as a Dice coefficient), such as validating novel methods for lesion–symptom mapping (e.g., Zhang et al., 2014) or evaluating concordance of language activation maps from functional magnetic resonance imaging across scanning sessions (e.g., Wilson et al., 2018). Although far from perfect (see Chicco & Jurman, 2020), the F1 score serves as a useful benchmark in this analysis for quickly comparing the performance of ParAlg across paraphasia types and between configurations.
Agreement Between the Two Transcription Configurations
Coding differences on the part of ParAlg between the two transcription configurations were judged qualitatively via comparison of F1 scores, as mentioned above, and visual inspection using an alluvial plot (Brunson, 2020; Rosvall & Bergstrom, 2010).
Item-Level Analysis of the Best Performing Configuration
In an effort to better assess ParAlg's performance, an item-level discrepancy analysis was conducted at the level of the binary judgment or linguistic feature (i.e., phonological similarity, semantic similarity) unless there was perfect agreement between ParAlg and human annotators, as was the case for lexicality (see the Results section for more details). In other words, we reviewed all target–response pairs where the human-annotated and algorithm-generated paraphasia codes disagreed with regard to linguistic features. Based on our initial debugging efforts, review of items at the linguistic feature level, as opposed to the paraphasia code level, yielded more information about the locus of breakdown in either our coding machinery or the human annotation process.
In the case of human annotators, linguistic features were derived from the original paraphasia code by subdividing the code along the same three dimensions of ParAlg's classifiers, given that our automated system is designed to emulate the same types of judgments necessary to generating a paraphasia code according to the PNT scoring guidelines. For example, a Formal paraphasia was re-expressed as + lexicality, + phonological similarity, and – semantic similarity. For ParAlg, these linguistic feature assignments were extracted from the raw data file outputted from ParAlg (see Supplemental Material S3 for more detail).
This analysis was conducted only on the best performing configuration (i.e., orthographic-lexical configuration, as detailed in the Results section), in line with the overall aim of using this configuration in future ParAlg development efforts. Three research assistants served as raters. Interrater agreement was quantified for fidelity purposes using a variation of the kappa statistic for fully crossed data with three or more raters (Light, 1971) and interpreted qualitatively following the work of Landis and Koch (1977).
Phonological similarity discrepancy analysis. First, each rater was provided a data file of unique target–response pairs in MAPPD-12K (n = 9,280), which was operationalized as pairs with the same target orthographic transcription, the same response orthographic transcription, the same response phonemic transcription, and the same human-annotated paraphasia code (see Supplemental Material S1 for more detail). This file was edited to contain only the available transcriptions, both phonemic and orthographic, for the targets and responses. Raters were instructed to (1) determine whether a phonological relationship was present among the pairs and, (2) if a phonological relationship was present, identify the type(s) of relationship(s) according to the PNT scoring guidelines, which, as described above, were (a) shared first phoneme, (b) shared final phoneme, (c) shared stressed vowel, (d) two or more shared phonemes in any position, and (e) one or more shared phonemes in the same left-to-right word and syllable position.
Next, all of the rated pairs were cross-referenced against the resultant phonological similarity discrepancies from the best performing configuration of ParAlg, and a subset of unique target–response pairs containing only these discrepancies was retained. Next, ratings were used in conjunction with the linguistic feature of interest (i.e., phonological similarity), as derived from both MAPPD human annotators and ParAlg, to categorize and subcategorize all unique discrepancy pairs. Categorization and subcategorization were completed by the first two authors (M.C. and G.F.). Categorization was operationalized as follows: (a) human coding error, where all three raters and the algorithm-generated feature assignment agreed with regard to the presence or absence of a phonological relationship while the extracted linguistic feature from the MAPPD human-annotated paraphasia code did not, and (b) ParAlg coding error, where all three of the raters' judgments and the extracted linguistic feature from the MAPPD human-annotated paraphasia code agreed with regard to the presence or absence of a phonological relationship while the algorithm-generated feature assignment did not. Responses where the three raters disagreed with regard to the phonological relationship were categorized as uncertain. Subcategorization was operationalized as each of the five scoring criteria outlined above and consisted of pairs where all three raters agreed on both the category and subcategory.
A second-level subcategorization was completed on pairs categorized to be ParAlg coding errors, in an effort to better debug the algorithm. This involved the first author (M.C.) reviewing all available data from the MAPPD-12K input file and ParAlg output file and identifying whether the misclassification was due to one of six subcategories previously identified as potential algorithm bugs: (a) reconstructing a response phonemic transcription, (b) assigning stress and syllabification, (c) stripping the plural suffix, (d) parsing rhotic vowels into their constituents, (e) including word-final vowels in linguistic feature assignment, and (f) overidentifying shared phonemes due to the presence of alternative phonemic transcriptions for the target and/or response. All other cases were subcategorized as “other.” For further details on how these subcategories are integrated into ParAlg's classification system, see the Appendix.
Semantic similarity discrepancy analysis. Following a similar approach to the phonological similarity discrepancy analysis, raters were again provided a data file of unique target–response pairs in MAPPD-12K. Given that semantic similarity in conventional PNT scoring criteria is restricted to lexical target–response pairs, only unique pairs coded as + lexical by the MAPPD human annotators (n = 4,446) were included for this item-level analysis. As before, the MAPPD-12K data file was edited; however, here, only orthographic transcriptions of both the target and the response were provided, as this judgment was made by both human annotators and ParAlg using orthography exclusively. Raters received analogous instructions to the phonological similarity discrepancy analysis, which was to judge the presence of a semantic relationship and then, if present, identify the type of relationship. With regard to the latter, relationship types were based on those listed in the PNT scoring manual and included the following: (a) synonym, (b) category coordinate, (c) superordinate, (d) subordinate, (e) associated, (f) diminutive, (g) semantically related proper name, and (h) shared morpheme, defined as the addition of a lemma to a monomorphemic target or the addition/substitution of a lemma in a compound target. Unlike its phonological counterpart, raters were not permitted to assign more than one relationship type to a given target–response pair, as our initial efforts with the item-level analysis revealed only a marginal number of cases to clearly have multiple types of semantic relationships.
As before, rated pairs were then cross-referenced against the best performing ParAlg configuration, and only a subset of unique target–response pairs where semantic similarity assignment was discrepant between MAPPD human annotators and ParAlg was retained. Categorization and subcategorization of rater judgments again followed those in the phonological similarity discrepancy analysis, although here, no second-level subcategorization was completed, as ParAlg's semantic similarity classification machinery is not rule based and, as such, cannot be debugged in a manner comparable to phonological similarity classification.
Results
Agreement Between ParAlg and Human Annotators
Agreement metrics for the two transcription configurations, both within and across paraphasia types, are displayed in Table 3. Across paraphasia types, both average and weighted-average F1 scores were greater than .70 for both configurations, although the latter, which accounts for uneven prevalence across classes, was consistently higher. Within paraphasia types, F1 score was highest for Neologism (phonemic-only: .89, orthographic-lexical: .98) and lowest for Mixed (phonemic-only: .63, orthographic-lexical: .70) and Unrelated (phonemic-only: .63, orthographic-lexical: .71). With the exception of Abstruse Neologism, F1 scores generally increased with the prevalence of a given paraphasia type in MAPPD-12K.
Table 3.
Evaluation metrics comparing agreement between human-annotated and algorithm-generated paraphasia codes.
| Paraphasia code | PPV Phon. | PPV Ortho. | Sensitiv. Phon. | Sensitiv. Ortho. | Specif. Phon. | Specif. Ortho. | F1 Phon. | F1 Ortho. | No. Obs. a |
|---|---|---|---|---|---|---|---|---|---|
| Abstruse Neologism | .70 | .91 | .80 | .91 | .97 | .99 | .75 | .91 | 1,000 |
| Formal | .79 | .91 | .71 | .80 | .95 | .98 | .75 | .85 | 2,471 |
| Mixed | .55 | .60 | .72 | .85 | .94 | .94 | .63 | .70 | 1,198 |
| Neologism | .89 | .98 | .90 | .98 | .94 | .99 | .89 | .98 | 4,450 |
| Semantic | .86 | .88 | .67 | .75 | .98 | .98 | .75 | .81 | 2,031 |
| Unrelated | .59 | .67 | .68 | .75 | .96 | .97 | .63 | .71 | 849 |
| Avg./Total b | .73 | .82 | .75 | .84 | .96 | .97 | .73 | .83 | 11,999 |
| Wt. Avg./Total c | .79 | .88 | .78 | .87 | .95 | .98 | .78 | .87 | 11,999 |
Note. PPV = positive predictive value; Sensitiv. = sensitivity; Specif. = specificity; Phon. = phonemic-only transcription configuration; Ortho. = orthographic-lexical transcription configuration.
The number of observations, or prevalence, for a given paraphasia, as coded by human annotators.
The average score of a given metric across the six paraphasia codes.
The average score, weighted by prevalence, across the six paraphasia codes.
Phonemic-Only Configuration
A confusion matrix stratified by human-annotated and algorithm-generated paraphasia codes revealed that the majority of responses were in agreement, as seen by comparing the values along the diagonal relative to the total values along the margins (see Table 4). Misclassification primarily occurred due to shifts in the assignment of semantic similarity, followed by lexicality. With regard to the former, ParAlg assigned 445 responses a Mixed code, which is + semantic, when the human annotator assigned Formal, which is – semantic. Moreover, 229 responses were labeled as Unrelated by ParAlg, requiring a – semantic feature assignment, when the human annotator labeled these as Semantic, denoting a + semantic feature assignment. With regard to the latter, misclassifications were most notable for Formal and Neologism paraphasias. Specifically, there were 312 cases where ParAlg assigned the code of Formal when the human annotator assigned the nonlexical analogue, Neologism. In the reverse, 189 target–response pairs were classified by ParAlg as Neologism when the human annotator assigned a Formal code.
Table 4.
Confusion matrix for the phonemic-only configuration.
| Human annotators | ParAlg |
||||||
|---|---|---|---|---|---|---|---|
| Abstruse Neologism | Formal | Mixed | Neologism | Semantic | Unrelated | All | |
| Abstruse Neologism | 804 | 6 | 0 | 79 | 6 | 105 | 1,000 |
| Formal | 7 | 1,764 | 445 | 189 | 16 | 50 | 2,471 |
| Mixed | 18 | 101 | 865 | 149 | 56 | 9 | 1,198 |
| Neologism | 91 | 312 | 51 | 3,987 | 1 | 8 | 4,450 |
| Semantic | 171 | 20 | 194 | 63 | 1,354 | 229 | 2,031 |
| Unrelated | 63 | 39 | 11 | 6 | 151 | 579 | 849 |
| All | 1,154 | 2,242 | 1,566 | 4,473 | 1,584 | 980 | 11,999 |
Note. Cells on the diagonal reflect agreement.
The agreement metrics (see Table 3, “Phon.” columns) mirrored observations from the confusion matrix. Regarding overall performance, average and weighted-average F1 scores were .73 and .78, respectively, with the latter score accounting for the uneven prevalence across the paraphasia types (ranging from 849 to 4,450). F1 scores for specific paraphasia codes were predominantly within the .6–.7 range. One F1 score was greater than .80 (i.e., .89 for Neologism). Sensitivity demonstrated a similar pattern, with two paraphasia types (Abstruse Neologism and Neologism) greater than .80 and the remaining in the .6–.7 range. Specificity was high, with values greater than .90 across all six paraphasia types.
Orthographic-Lexical Configuration
As with the phonemic-only configuration, a confusion matrix showed that the majority of responses were in agreement (see Table 5). Here, misclassification primarily occurred due to shifts in the assignment of semantic similarity, followed by phonological similarity. For instance, in the case of the former, ParAlg assigned 441 responses a Mixed code, which is + semantic, when the human annotator assigned Formal, which is – semantic. Similarly, ParAlg assigned 257 responses an Unrelated code, which is – semantic, when the human annotator assigned Semantic, which is + semantic. In the case of the latter, there were 92 instances where ParAlg assigned Abstruse Neologism (− phonology) when the human annotator assigned Neologism (+ phonology) and 86 responses where the opposite occurred.
Table 5.
Confusion matrix for the orthographic-lexical configuration.
| Human annotators | ParAlg |
||||||
|---|---|---|---|---|---|---|---|
| Abstruse Neologism | Formal | Mixed | Neologism | Semantic | Unrelated | All | |
| Abstruse Neologism | 914 | 0 | 0 | 86 | 0 | 0 | 1,000 |
| Formal | 0 | 1,970 | 441 | 0 | 12 | 48 | 2,471 |
| Mixed | 0 | 125 | 1,019 | 0 | 47 | 7 | 1,198 |
| Neologism | 92 | 0 | 0 | 4,358 | 0 | 0 | 4,450 |
| Semantic | 0 | 26 | 232 | 0 | 1,516 | 257 | 2,031 |
| Unrelated | 0 | 43 | 13 | 0 | 155 | 638 | 849 |
| All | 1,006 | 2,164 | 1,705 | 4,444 | 1,730 | 950 | 11,999 |
Note. Cells on the diagonal reflect agreement.
Average and weighted-average F1 scores across all six paraphasia types for the orthographic-lexical configuration were .83 and .87, respectively (see Table 3, “Ortho.” columns). F1 scores for paraphasia code–specific performance were greater than .90 for Abstruse Neologism and Neologism, greater than .80 for Formal and Semantic, and .70 or greater for Mixed and Unrelated. Sensitivity was also high across paraphasia types, with two paraphasia types (Abstruse Neologism and Neologism) above .90, two (Formal and Mixed) at or above .80, and the remaining above .70. Specificity values were greater than .90 for all paraphasia types.
Agreement Between the Two Transcription Configurations
Although both transcription configurations demonstrated moderate-to-high agreement with human-annotated paraphasia codes, the orthographic-lexical configuration consistently outperformed the phonemic-only configuration (see Table 3). Specifically, overall classification performance on the part of ParAlg, measured as the average and weighted-average F1 scores, was .10 and .09 higher, respectively, for the orthographic-lexical configuration. Moreover, F1 scores and sensitivity for all six paraphasia types were .07–.16 higher for the orthographic-lexical configuration as compared with the phonemic-only configuration.
An alluvial plot (see Figure 4) also shows less frequent divergence of algorithm-generated paraphasia codes, as compared to human-annotated codes, in the orthographic-lexical configuration. Of the 11,999 target–response pairs in MAPPD-12K, 1,410 paraphasia codes were different across the two transcription configurations. Of those, 1,127 were in agreement with the human-annotated codes for the orthographic-lexical configuration but not the phonemic-only configuration, and 65 displayed the opposite pattern. The remaining 218 were misaligned with the human-annotated codes for both configurations.
Figure 4.

Paraphasia classification differences between the two transcription configurations. The y-axis depicts the number of the paraphasia code differences (n = 1,410) within the six possible paraphasia types and their change across the two different output conditions (leftmost and rightmost columns), as compared with the human-annotated codes (center column) along the x-axis. Paraphasia code agreements are not plotted. A.Neologism = Abstruse Neologism; Phonemic = paraphasia codes from ParAlg's phonemic-only configuration; Orthographic = paraphasia codes from ParAlg's orthographic-lexical configuration; Human = paraphasia codes from human annotators in MAPPD-12K.
Item-Level Analysis of the Best Performing Configuration
Given the consistently superior performance of the orthographic-lexical configuration, the item-level analysis was conducted with data from this configuration exclusively.
Phonological Similarity
For phonological feature assignment, there were 606 discrepancies between human annotators and ParAlg (.05 proportion of MAPPD-12K), of which 492 were unique target–response pairs (.05 proportion of 9,280 total unique phonemic pairs). Interrater agreement was moderate for assigning phonological similarity (κ = .42) and ranged from substantial to almost perfect (κ = .71–.88) for assigning the phonological relationship to one or more subcategories.
After accounting for the uncertain category (.44, or 214 paraphasias), where the three raters failed to have exact agreement, the largest proportion of unique phonological similarity discrepancies were found to be due to a human coding error (.38, or 180 paraphasias; see Table 6). ParAlg coding errors constituted the smallest proportion of the misclassified responses (.20, or 98 paraphasias). When further subcategorized, the majority of both human- and ParAlg-related coding errors were no relationship (human: .38, or 69 paraphasias; ParAlg: .50, or 49 paraphasias) or the assignment of a phonological relationship against majority consensus, followed by a failure to identify (a) shared syllable and word position (human: .38, or 68 paraphasias; ParAlg: .34, or 33 paraphasias) and (b) shared primary stress (human: .25, or 45 paraphasias; ParAlg: .25, or 24 paraphasias). The second-level subcategorization revealed that almost half of the 98 ParAlg-related errors were attributable to inaccurate stress or syllabification assignment (.42, or 41 paraphasias), although 22 of these were nonlexical responses and ultimately ambiguous in nature given the lack of a ground truth (i.e., dictionary) for reference. Reconstruction of the response phonemic transcription also accounted for a substantive number of discrepancies (.21, or 21 paraphasias), as did vowel-related errors (.11, or 11 paraphasias). The remaining subcategories were relatively evenly represented among the remaining 25 target–response pairs.
Table 6.
Subcategorization of phonological similarity classification discrepancies.
| Human error (n = 180) |
ParAlg error (n = 98) |
|||
|---|---|---|---|---|
| Subcategory | Proportion | Count | Proportion | Count |
| Final phoneme | .20 | 35 | .08 | 8 |
| First phoneme | .13 | 24 | .05 | 5 |
| No relationship | .38 | 69 | .50 | 49 |
| Stressed vowel | .25 | 45 | .24 | 24 |
| Syllable/word position | .38 | 68 | .34 | 33 |
| Two phonemes | .18 | 33 | .14 | 14 |
Note. An additional 214 cases were categorized as uncertain, where the three rater judgments were not in exact agreement with regard to the presence or absence of phonological similarity. All subcategories except “No relationship” were permitted to overlap. Rater disagreement for subcategorization was not quantified due to overlap among the subcategories.
Semantic Similarity
For semantic similarity, there were 1,036 semantic discrepancies in total (.09 proportion of MAPPD-12K), of which 811 were unique target–response pairs (.09 proportion of 9,280 total unique pairs). Interrater agreement was moderate for both assigning semantic similarity (κ = .60) and for subcategorizing the primary type of semantic relationship (κ = .55).
Unlike phonological misclassification, the majority of target–response pairs (.63, or 511 paraphasias) with semantic misclassification were due to a ParAlg coding error (see Table 7). Human coding errors constituted a fraction (.08, or 63 paraphasias) of the misclassified unique responses, whereas the remainder (.29, or 235 paraphasias) were categorized as uncertain due to a lack of exact agreement among the three raters. When further subcategorized, similar patterns arose for both the human and ParAlg coding errors but with differing magnitudes. False positives, or identification of semantic similarity against the majority consensus, constituted the bulk of the misclassifications for human annotators (.38, or 24 paraphasias) and ParAlg (.69, or 352 paraphasias). Associated target–response pairs yielded the second most frequent misclassifications (human: .30, or 19 paraphasias; ParAlg: .12, or 62 paraphasias). Superordinate, subordinate, category coordinate, synonym, diminutive, and morphological misclassifications were rare or absent among the misclassification subcategories for both human annotators and ParAlg.
Table 7.
Subcategorization of semantic similarity classification discrepancies.
| Subcategory | Human error (n = 63) |
ParAlg error (n = 511) |
||
|---|---|---|---|---|
| Proportion | Count | Proportion | Count | |
| Associated | .30 | 19 | .12 | 62 |
| Category coordinate | .03 | 2 | .04 | 20 |
| Diminutive | .00 | 0 | .00 | 1 |
| No relationship | .38 | 24 | .69 | 352 |
| Rater disagreement a | .19 | 12 | .10 | 52 |
| Related proper name | .00 | 0 | .00 | 0 |
| Shared morpheme | .08 | 5 | .01 | 3 |
| Subordinate | .00 | 0 | .01 | 6 |
| Superordinate | .00 | 0 | .03 | 13 |
| Synonym | .00 | 0 | .00 | 2 |
Note. An additional 235 cases were categorized as uncertain, where the three rater judgments were not in exact agreement with regard to the presence or absence of semantic similarity. There were two cases of data missing at random for rater judgments of semantic similarity and semantic subcategorization. There were another five cases of data missing at random for semantic subcategorization only, four of which were categorized as uncertain. The remaining case was categorized as a human error; as such, it was included in the numeric total but not listed in the table.
Cases where the three raters' subcategorization of a human- or ParAlg-related error was not in exact agreement.
Discussion
The primary aim of this study was to quantify the performance of our algorithmic approach to paraphasia classification following key coding updates and under two configurations of transcription input. In line with our prior findings (Fergadiotis et al., 2016), algorithm-generated and human-annotated paraphasia codes demonstrated high agreement across the two configurations and when using a multinomial decision tree system, showing ParAlg to be a robust alternative to manual scoring.
Agreement Between ParAlg and Human Annotators
Overall, our algorithmic approach was highly accurate across the six paraphasia types for both transcription configurations, with average F1 scores in the .70–.80 range. In other words, both positive predictive value and sensitivity metrics were strong across the paraphasia types, as can be observed in Table 3 (final two rows and first four columns). Returning to the concepts of retrieved and relevant information and their importance in appraising a novel algorithm, a high F1 score indicates that ParAlg is not only retrieving information but also retrieving information relevant to the task at hand, thus indicating that ParAlg acts as an excellent proxy for the human scoring process. Notably, weighted-average F1 scores were higher than unweighted-average scores. Given uneven prevalence across the six paraphasia categories (range: 849–4,450), the weighted-average F1 scores, which account for this unevenness, likely represent a more accurate and realistic estimate of ParAlg's agreement with human annotators overall.
Within paraphasia types, certain paraphasia types demonstrated greater agreement than others. Specifically, agreement appeared to be at least partially contingent on the prevalence of a given paraphasia type within the data set, with the lower incidence codes Unrelated and Mixed showing relatively weaker agreement. In both of these cases, positive predictive value was lower than sensitivity, the former of which is directly prevalence dependent (Parikh et al., 2008) and, as the lower score, was differentially weighted in the calculation of the F1 score for these two paraphasia codes. As such, it is possible that agreement between manual and automated scoring approaches for these two paraphasia types may be artificially attenuated due to the characteristics of our sample, and another data set with a greater prevalence of Unrelated and Mixed paraphasias may demonstrate relatively improved agreement.
Agreement Between the Two Transcription Configurations
Despite high agreement with respect to human-annotated paraphasia codes across both transcription configurations, the orthographic-lexical configuration, where response orthographic transcriptions for real words are provided, consistently outperformed the phonemic-only configuration, where response phonemic transcriptions were exclusively used irrespective of lexical status.
Although improved performance was observed across metrics and paraphasia codes, the magnitude of improvement in the orthographic-lexical configuration was greatest for Abstruse Neologisms (F1 score increase of .16), Formal (.10), and Neologism (.09). Notably, all of these paraphasia types are especially sensitive to the phonemic transcription of both targets and responses, as well as additional phonetic information (i.e., stress and syllabification) automatically assigned by ParAlg, for accurate assignment of lexicality and phonological similarity. Specifically, as visually represented in Figure 4, there were 126 cases of between-configurations coding differences where the human-annotated code was Abstruse Neologism, 115 of which were misclassified in the phonemic-only configuration and subsequently corrected in the orthographic-lexical configuration. Similarly, 211 of 266 and 372 of 382 cases where the human-annotated code was Formal and Neologism, respectively, were corrected in the better performing configuration. Although most of the corrected misclassifications were strictly related to lexicality (Abstruse Neologism: .90, or 103 paraphasias; Formal: .71, or 151 paraphasias; Neologism: .84, or 311 paraphasias), corrections along multiple dimensions also occurred. For example, there were 51 cases where the phonemic-only code Mixed was corrected to Neologism. In other words, the incorrect assignment of + lexical led to erroneous semantic analysis and misassignment of + semantic, yielding a Mixed code in place of the Neologism code. These cases reflect the hypothesized multiplicative error effect of using a multinomial decision tree for paraphasia classification.
The differential improvement of the orthographic-lexical configuration in assigning not only lexicality but also downstream linguistic judgments of semantic and phonological similarity is due to the fact that this configuration removes two primary sources of error: (a) false-positive neologisms, where the assignment of a word to a string of transcribed phonemes fails to identify a word due to subtle transcription differences or other sources of error, and (b) lexical mismatches, where an inappropriate word is assigned to a string of transcribed phonemes. Removing these sources of error comes at a cost, however, which is the bypassing of one of three binary classifiers (i.e., lexicality) that constitute ParAlg's multinomial classification system. From a computational perspective, the improvements gained in the orthographic-lexical configuration are not attributable to mechanistic advancements of ParAlg and, as such, are not directly comparable to the phonemic-only configuration. Rather, the orthographic-lexical configuration represents a distinct way in which ParAlg can be used and one that requires a “helping hand” from the human annotator. In keeping the ultimate goal of ParAlg in mind, which is the development of a software program for clinical use, this modest burden on the human annotator of judging lexicality has differentially greater benefits, such as the increased efficiency that comes with transcribing in orthography and decreased error with which ParAlg classifies the response. Moreover, preliminary analysis of an unpublished study investigating the reliability of human annotators in assigning lexicality to responses on the PNT has shown them to be highly reliable (ICC A,1 = 0.92; 95% CI [.89, 0.95]), negating potential concerns of regarding interrater variability for this type of human judgment in manual scoring.
Item-Level Analysis of the Best Performing Configuration
Discrepancies between ParAlg-generated and human-annotated paraphasia codes in the orthographic-lexical configuration were marginal (1,584 out of 11,999 total target–response pairs). Of those discrepancies, 1,249 were unique target–response pairs and, when re-expressed as binary linguistic features, 438 and 757 were exclusively due to phonological and semantic similarity misclassification, respectively, with an additional 54 attributable to both features. A qualitative review of these discrepancies showed 243 unique misclassifications to be due to human error, the majority in phonological similarity assignment (n = 180), and 609 unique misclassifications—approximately half of all unique discrepancies—on the part of ParAlg, the majority of which were due to errors in assigning semantic similarity (n = 511).
For phonological feature assignment, erroneous assignment of phonological similarity against majority consensus was the predominant type of misclassification for both human annotators and ParAlg. For the former group, identifying shared stressed vowel or shared word and syllable position was also a substantial source of error. For example, a human annotator failed to identify the shared vowel /æ/ across the target banana and the response “apple,” which led to classifying the pair as Semantic rather than Mixed, as ParAlg did. This may reflect a relatively greater cognitive demand for identifying certain types of phonemic features, particularly if coded directly from live or audio recording as opposed to a phonemic transcription. For ParAlg, there were similar patterns, albeit at smaller magnitudes, within the subcategories. However, when rereviewed during the second-level analysis, several of these errors were found to be attributable to automated stress and syllabification assignment. For example, the response “balloon,” in response to the target drum, was assigned stress on the schwa phoneme in the first syllable, as opposed to /u/ in the second syllable, which led to inaccurate assignment of phonological similarity and classification as a Formal paraphasia, as opposed to the Unrelated paraphasia assigned by the human annotator. Although the total number of unique misclassifications of this nature is relatively small (n = 42), automated stress assignment has been identified as a bug to be optimized in future coding efforts.
For semantic feature assignment, human-related error was primarily related to the inappropriate assignment of semantic similarity, or false positives, followed by target–response pairs that possessed an associated relationship. For example, the target snail and response “seal” pair was identified as semantically similar by the human annotator but not by the three raters and ParAlg. An identical pattern, but to a larger extent, was observed for the ParAlg-related errors. For instance, the target microscope and response “graph” pair was identified by ParAlg but not by the three raters and human annotator as semantically similar. In both examples, there was a distal relation of the supraordinate categories of animals and science, respectively, but one that did not reach an internal (human-annotated) or probabilistic (ParAlg) threshold for similarity and highlights the inherent subjectivity of semantic feature assignment. With regard to ParAlg, it is possible that other word meaning models (e.g., Devlin et al., 2018) may better handle nuanced semantic cases such as these, and efforts to quantify performance across models using ParAlg are currently underway.
Limitations and Future Directions
Although the findings in this study represent a substantive improvement in ParAlg's classification agreement, additional coding efforts are necessary for developing our algorithmic approach into a clinically feasible assessment tool. Specifically, ParAlg remains unable to handle other response categories identified in the PNT scoring manual (e.g., Descriptions) or responses containing two or more words. Optimizing ParAlg to classify these types of responses is of high priority, given the high prevalence of these response types observed in the larger MAPPD data set (n = 2,467 for Descriptions) and the necessity of differentiating so-called “nonnaming” responses from the six paraphasia types outlined here for input into the sp computational model of spoken language production (Foygel & Dell, 2000). Nonetheless, assuming that correct and nonnaming target–response pairs are manually disambiguated from a given PNT administration, preliminary evidence suggests that the high agreement observed here is generalizable to patient-level data and to sp model parameter estimation, the latter of which can be understood to represent quantitative proxies of lexical–semantic and phonological processing during word production (Casilio et al., 2019). To this end, ParAlg may be of particular value in refining sp or similar models, as it yields additional information above and beyond the paraphasia code, such as the number and type of phonological relationships among targets and responses, that could be used to better elucidate the subcomponents that comprise successful or unsuccessful encoding of lexical semantics or phonology.
Additionally, ParAlg development to date has been benchmarked using an archival data set that lacks access to audio recordings. This is particularly disadvantageous with regard to transcription, where assumptions must necessarily be made regarding the intent or purpose of certain conventions or scoring decisions. Despite careful attempts to make minimal changes to the original data, there is a possibility that unintended errors were generated in the data set preparation process of this study. Moreover, the impoverished aspects of the transcriptions, particularly the lack of important phonetic markers (i.e., stress and syllabification markings) in the response phonemic transcriptions, inherently limit the debugging process and future coding efforts. In other words, because certain target–response pairs could not be identified as ParAlg- or human-related errors (e.g., discrepancies related to the stress and syllabification of nonlexical responses), an upper bound is necessarily set on the degree to which ParAlg can be optimized to agree with the human-annotated code criterion.
Finally, this study presumes perfect coding on the part of the human annotator. As noted previously, there are few studies reporting interrater reliability of PNT paraphasia classification and those that do show variable reliability among paraphasia types. For example, Walker et al. (2022) reported interrater reliability for polytomous scores of eight possible codes from the PNT scoring manual (i.e., the six paraphasia types, in addition to correct and nonnaming codes) using unweighted Cohen's kappa and found reliability to be in the almost perfect range on average (κ = .81) but was as low as the moderate range (κ = .52) for certain code types. Although our item-level qualitative analysis provides insights into human misclassification and additionally suggests that human scoring can vary, it is unclear the degree to which this information is generalizable, particularly given the reliance of this study on archival data and the moderate agreement among the three raters of this study who reviewed the discrepancies. Efforts are underway to comprehensively quantify this reported variability in a reliability study of PNT paraphasia coding, with the aim of using the resultant data to set an expected range of reliability for human-annotated paraphasia codes and embed that range into metrics for calculating agreement (e.g., F1 score). This attenuation of our criteria with empirical data will not only provide a more meaningful picture of ParAlg's performance but also allow for more specific interpretation of the F1 scores observed in this study, as human-annotated reliability will have been thoroughly evaluated and, thus, can serve as a more informative benchmark. Specifically, it will be useful in setting an appropriate lower bound that ParAlg must necessarily meet or exceed for implementation in clinical practice.
Conclusions
ParAlg, our algorithmic approach to classifying picture naming errors, is a highly accurate and efficient alternative to human-annotated scoring, particularly when orthographic transcriptions are used as input for lexical responses. Although further development is needed before its release as a developed software application, the prior and current findings reinforce its potential as a useful tool for assessing anomia in both research endeavors and clinical practice.
Data Availability Statement
All data from this study were extracted from the freely available Moss Aphasia Psycholinguistics Project Database (MAPPD), which can be accessed via a login at http://www.mappd.org. The MAPPD-12K data set created for this study, along with a summary file containing an abbreviated version of the ParAlg output file and item-level qualitative analyses, is included in Supplemental Materials S2 and S3, as mentioned above. A copy of the ParAlg software is available upon request.
Supplementary Material
Acknowledgments
This work was supported by National Institute on Deafness and Other Communication Disorders Grant R01DC015999 (principal investigators: Steven Bedrick and Gerasimos Fergadiotis). The authors also thank the Moss Aphasia Psycholinguistics Project Database (MAPPD) participants who donated their time; Adelyn Brecher for her assistance with interpretation of the Philadelphia Naming Test guidelines; Brooke Cowan for her work in code development; Alex Swiderski for his preliminary extraction of data from the MAPPD; Mikala Fleegle for her helpful comments; and Hattie Olson, Khanh Nguyen, Emily Tudorache, and Mia Cywinski for their efforts in reviewing paraphasia discrepancies.
Appendix
Transcription Preprocessing
Prior to paraphasia classification within ParAlg's multinomial processing tree, multiple preprocessing steps are undertaken and differ according to the type of transcription (phonemic vs. orthographic).
Phonemic Transcription Preprocessing
The input file (i.e., MAPPD-12K) for this study, as common to the fields of speech-language pathology and linguistics, contains phonemic transcriptions of participant responses that are written in International Phonetic Alphabet (IPA). The font is underwritten in Unicode (see Supplemental Material S1 for details on when and how non-Unicode IPA fonts were handled). These response phonemic transcriptions are first standardized to match conventions specified in ParAlg and fixes any common transcription-related errors (e.g., use of c for the phoneme /k/). Diacritics or extraneous punctuation is additionally stripped during the standardization process.
The corrected and standardized IPA transcriptions of responses then undergo a one-to-one phoneme translation into ARPAbet (Rice, 1976), a phonemically based transcription system commonly used in computer science and natural language processing. As an example, IPA transcription of the response cat would be written in IPA as /kæt/ and translated into ARPAbet as K AE T. Then, using an algorithm within ParAlg, phonetic features important to judging phonological similarity among target–response pairs are added to the resultant ARPAbet transcription of the response. Specifically, the transcription is syllabified according to recent work on generative phonotactics (Gorman, 2013), and primary stress is then applied following constraints outlined by Halle (1973). Additionally, a morphological analysis of the response transcription is then conducted, and plural suffixes are stripped from the transcription if present, in accordance with the Philadelphia Naming Test's (PNT's) scoring rule that restricts assignment of phonological similarity to the singular form of both the target and the response. Finally, at this stage, rhotic vowels are split into their constituent phonemes in preparation for analysis, as the PNT scoring procedure takes the implicit theoretical assumption that rhotic vowels can be further subdivided into vowel–consonant units. For example, the IPA transcription of the rhotic /ɚ/ would first be translated to ARPAbet ER and then segmented into ARPAbet equivalents of /ə/ and /ɹ/, which are AH and R.
In cases where the input file is missing a phonemic transcription or the transcription has been purposefully stripped from a given response, ParAlg contains machinery for generating a standardized ARPAbet form via the response orthographic transcription. Specifically, the orthographic form is compared against an ARPAbet pronunciation dictionary (Carnegie Mellon University Pronouncing Dictionary [CMUdict] Version 0.7b) that has been edited to ensure a one-to-one match of all phonemes for the 175 PNT target transcriptions, as specified as part of the PNT documentation materials (https://mrri.org/philadelphia-naming-test/). When an exact orthographic match is identified within this edited version of CMUdict, henceforth referred to as PSUdict, its corresponding ARPAbet transcription is selected and populated into the input file for subsequent analysis. In the case where multiple ARPAbet transcriptions are specified for a single response orthographic transcription, reflecting the presence of a homograph or dialectal differences across pronunciations, the first ARPAbet transcription listing is selected.
PSUdict is not a fully comprehensive dictionary, and preliminary testing of ParAlg's machinery revealed a small number of lexical responses that had no entry. Rather than append PSUdict with new entries, a grapheme-to-phoneme machine learning model—based on architecture from the work of Gehring et al. (2017) and trained on PSUdict—is used to generate a predicted ARPAbet transcription from the response orthographic transcription. In the case of this study, the grapheme-to-phoneme model was utilized for 122 out of 5,449 lexical responses in the orthographic-lexical configuration.
Importantly, the aforementioned procedure is only applicable to preprocessing phonemic transcription for responses. Because standardized phonemic transcriptions are fixed for the target items on the PNT and have been integrated into the pronunciation dictionary PSUdict, no target transcription is specified in the input file. Rather, the ARPAbet form is populated directly from PSUdict using the same procedure for generating a phonemic transcription for responses, albeit with the following caveat: If multiple ARPAbet transcriptions are specified for a single target orthographic transcription, as is the case for 22 of the 175 targets, all transcriptions are selected, given that all are designated as acceptable within the PNT scoring materials, and the data set of use in this study does not specify assumptions regarding which transcription should be considered standard for an individual participant.
Orthographic Transcription Preprocessing
As alluded to in the previous section, orthographic transcriptions of the targets are present in the input file for this study (i.e., MAPPD-12K) and, depending on the transcription configuration, included for the majority of lexical and a small minority of nonlexical responses. No automated procedure for editing or standardizing orthographic transcriptions for responses is currently implemented in ParAlg's machinery; rather, response orthographic transcriptions from this study were hand-edited prior to study initiation (see Supplemental Material S1 for additional details). Moreover, no other preprocessing steps, such as translation, are required other than the same morphological analysis outlined for the phonemic transcription, where plural suffixes are stripped. In the case of the response orthographic transcription, this stripping is necessary for quantifying the semantic similarity between the target-and-response pair, as outlined in greater detail in the Method section of this article.
For cases where the input response orthographic transcription is missing or has been purposefully stripped, ParAlg's machinery has the capacity to generate a response orthographic transcription from the standardized ARPAbet transcription using the inverse of the procedure outlined above for phonemic transcription generation. Specifically, the ARPAbet transcription is compared against entries in PSUdict, and when an exact match is identified, the corresponding orthography is selected. When the ARPAbet transcription matches more than one orthographic entry in PSUdict, as is the case for homophones, the first match, as identified via an alphabetic search of the dictionary, is selected. If no one-to-one ARPAbet transcription match is identified, no response orthographic transcription is populated for subsequent analyses.
Importantly, as is the case with this study's input file and likely the case for many data sets of picture naming responses, only response orthographic transcriptions were missing or purposefully stripped. All target transcriptions were present and retained.
Funding Statement
This work was supported by National Institute on Deafness and Other Communication Disorders Grant R01DC015999 (principal investigators: Steven Bedrick and Gerasimos Fergadiotis). The authors also thank the Moss Aphasia Psycholinguistics Project Database (MAPPD) participants who donated their time; Adelyn Brecher for her assistance with interpretation of the Philadelphia Naming Test guidelines; Brooke Cowan for her work in code development; Alex Swiderski for his preliminary extraction of data from the MAPPD; Mikala Fleegle for her helpful comments; and Hattie Olson, Khanh Nguyen, Emily Tudorache, and Mia Cywinski for their efforts in reviewing paraphasia discrepancies.
Footnote
See downloadable resource materials, particularly the PDF file “PNT glossary of terms,” for a detailed overview of scoring guidelines and procedures.
References
- Beeson, P. M. , Rising, K. , DeMarco, A. T. , Foley, T. H. , & Rapcsak, S. Z. (2018). The nature and treatment of phonological text agraphia. Neuropsychological Rehabilitation, 28(4), 568–588. https://doi.org/10.1080/09602011.2016.1199387 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beeson, P. M. , Rising, K. , Kim, E. S. , & Rapcsak, S. Z. (2010). A treatment sequence for phonological alexia/agraphia. Journal of Speech, Language, and Hearing Research, 53(2), 450–468. https://doi.org/10.1044/1092-4388(2009/08-0229) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boyle, M. , & Coelho, C. A. (1995). Application of semantic feature analysis as a treatment for aphasic dysnomia. American Journal of Speech-Language Pathology, 4(4), 94–98. https://doi.org/10.1044/1058-0360.0404.94 [Google Scholar]
- Brunson, J. C. (2020). Ggalluvial: Layered grammar for alluvial plots. Journal of Open Source Software, 5(49), 2017. https://doi.org/10.21105/joss.02017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brysbaert, M. , & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. https://doi.org/10.3758/BRM.41.4.977 [DOI] [PubMed] [Google Scholar]
- Casilio, M. , Fergadiotis, G. , Bedrick, S. , & McKinney-Bock, K. (2019). Can machines classify picture naming errors? Evidence from Dell's model [Paper presentation] . Academy of Aphasia 57th Annual Meeting, Macau. [Google Scholar]
- Chicco, D. , & Jurman, G. (2020). The advantages of the Matthews Correlation Coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 1–13. https://doi.org/10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coelho, C. A. , McHugh, R. E. , & Boyle, M. (2000). Semantic feature analysis as a treatment for aphasic dysnomia: A replication. Aphasiology, 14(2), 133–142. https://doi.org/10.1080/026870300401513 [Google Scholar]
- Dell, G. S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3), 283–321. https://doi.org/10.1037/0033-295X.93.3.283 [PubMed] [Google Scholar]
- Dell, G. S. , Schwartz, M. F. , Martin, N. , Saffran, E. M. , & Gagnon, D. A. (1997). Lexical access in aphasic and nonaphasic speakers. Psychological Review, 104(4), 801–838. https://doi.org/10.1037/0033-295X.104.4.801 [DOI] [PubMed] [Google Scholar]
- DeMarco, A. T. , Wilson, S. M. , Rising, K. , Rapcsak, S. Z. , & Beeson, P. M. (2018). The neural substrates of improved phonological processing following successful treatment in a case of phonological alexia and agraphia. Neurocase, 24(1), 31–40. https://doi.org/10.1080/13554794.2018.1428352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devlin, J. , Chang, M.-W. , Lee, K. , & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805.
- Dudy, S. , Asgari, M. , & Kain, A. (2015). Pronunciation analysis for children with speech sound disorders [Paper presentation] . 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milano, Italy. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fergadiotis, G. , Gorman, K. , & Bedrick, S. (2016). Algorithmic classification of five characteristic types of paraphasias. American Journal of Speech-Language Pathology, 25(4S), S776–S787. https://doi.org/10.1044/2016_AJSLP-15-0147 [DOI] [PubMed] [Google Scholar]
- Fergadiotis, G. , Hula, W. D. , Swiderski, A. M. , Lei, C.-M. , & Kellough, S. (2019). Enhancing the efficiency of confrontation naming assessment for aphasia using computer adaptive testing. Journal of Speech, Language, and Hearing Research, 62(6), 1724–1738. https://doi.org/10.1044/2018_JSLHR-L-18-0344 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fergadiotis, G. , Kellough, S. , & Hula, W. D. (2015). Item response theory modeling of the Philadelphia Naming Test. Journal of Speech, Language, and Hearing Research, 58(3), 865–877. https://doi.org/10.1044/2015_JSLHR-L-14-0249 [DOI] [PubMed] [Google Scholar]
- Fergadiotis, G. , Swiderski, A. , & Hula, W. D. (2019). Predicting confrontation naming item difficulty. Aphasiology, 33(6), 689–709. https://doi.org/10.1080/02687038.2018.1495310 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fletcher, G. S. (2019). Clinical epidemiology: The essentials. Lippincott Williams & Wilkins. [Google Scholar]
- Foygel, D. , & Dell, G. S. (2000). Models of impaired lexical access in speech production. Journal of Memory and Language, 43(2), 182–216. https://doi.org/10.1006/jmla.2000.2716 [Google Scholar]
- Gehring, J. , Auli, M. , Grangier, D. , Yarats, D. , & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning [Paper presentation] . Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia. [Google Scholar]
- Goldrick, M. (2016). Integrating SLAM with existing evidence: Comment on Walker and Hickok (2015). Psychonomic Bulletin & Review, 23(2), 648–652. https://doi.org/10.3758/s13423-015-0946-9 [DOI] [PubMed] [Google Scholar]
- Goodglass, H. , & Wingfield, A. (1997). Anomia: Neuroanatomical and cognitive correlates. Academic Press. [Google Scholar]
- Gorman, K. (2013). Generative phonotactics [Unpublished doctoral dissertation] . University of Pennsylvania. [Google Scholar]
- Halle, M. (1973). Stress rules in English: A new version. Linguistic Inquiry, 4(4), 451–464. [Google Scholar]
- Hill, F. , Reichart, R. , & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665–695. https://doi.org/10.1162/COLI_a_00237 [Google Scholar]
- Huang, E. H. , Socher, R. , Manning, C. D. , & Ng, A. Y. (2012). Improving word representations via global context and multiple word prototypes [Paper presentation] . Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia. [Google Scholar]
- Hula, W. D. , Fergadiotis, G. , Swiderski, A. M. , Silkes, J. P. , & Kellough, S. (2020). Empirical evaluation of computer-adaptive alternate short forms for the assessment of anomia severity. Journal of Speech, Language, and Hearing Research, 63(1), 163–172. https://doi.org/10.1044/2019_JSLHR-L-19-0213 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hula, W. D. , Kellough, S. , & Fergadiotis, G. (2015). Development and simulation testing of a computerized adaptive version of the Philadelphia Naming Test. Journal of Speech, Language, and Hearing Research, 58(3), 878–890. https://doi.org/10.1044/2015_JSLHR-L-14-0297 [DOI] [PubMed] [Google Scholar]
- Kendall, D. L. , Oelke, M. , Brookshire, C. E. , & Nadeau, S. E. (2015). The influence of phonomotor treatment on word retrieval abilities in 26 individuals with chronic aphasia: An open trial. Journal of Speech, Language, and Hearing Research, 58(3), 798–812. https://doi.org/10.1044/2015_JSLHR-L-14-0131 [DOI] [PubMed] [Google Scholar]
- Kendall, D. L. , Rosenbek, J. C. , Heilman, K. M. , Conway, T. , Klenberg, K. , Rothi, L. J. G. , & Nadeau, S. E. (2008). Phoneme-based rehabilitation of anomia in aphasia. Brain and Language, 105(1), 1–17. https://doi.org/10.1016/j.bandl.2007.11.007 [DOI] [PubMed] [Google Scholar]
- Kertesz, A. (2006). Western Aphasia Battery–Revised (WAB-R). Pro-Ed. [Google Scholar]
- Landis, J. R. , & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310 [PubMed] [Google Scholar]
- Leacock, C. , Chodorow, M. , Gamon, M. , & Tetreault, J. (2010). Automated grammatical error detection for language learners. Synthesis Lectures on Human Language Technologies, 3(1), 1–134. https://doi.org/10.1007/978-3-031-02153-4 [Google Scholar]
- Leonard, C. , Rochon, E. , & Laird, L. (2008). Treating naming impairments in aphasia: Findings from a phonological components analysis treatment. Aphasiology, 22(9), 923–947. https://doi.org/10.1080/02687030701831474 [Google Scholar]
- Levelt, W. J. , Roelofs, A. , & Meyer, A. S. (1999). A theory of lexical access in speech production. Behavioral and Brain Sciences, 22(1), 1–38. https://doi.org/10.1017/S0140525X99001776 [DOI] [PubMed] [Google Scholar]
- Light, R. J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76(5), 365–377. https://doi.org/10.1037/h0031643 [Google Scholar]
- Lipton, Z. C. , Elkan, C. , & Naryanaswamy, B. (2014). Optimal thresholding of classifiers to maximize F1 measure [Paper presentation] . Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France. https://doi.org/10.1007/978-3-662-44851-9_15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manning, C. D. , Raghavan, P. , & Schutze, H. (2008). Introduction to information retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071 [Google Scholar]
- McKinney-Bock, K. , & Bedrick, S. (2019). Classification of semantic paraphasias: Optimization of a word embedding model [Paper presentation] . Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, Minneapolis, MN, United States. https://doi.org/10.18653/v1/W19-2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meier, E. L. , Sheppard, S. M. , Goldberg, E. B. , Head, C. R. , Ubellacker, D. M. , Walker, A. , & Hillis, A. E. (2020). Naming errors and dysfunctional tissue metrics predict language recovery after acute left hemisphere stroke. Neuropsychologia, 148, 107651. https://doi.org/10.1016/j.neuropsychologia.2020.107651 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mikolov, T. , Yih, W.-T. , & Zweig, G. (2013). Linguistic regularities in continuous space word representations [Paper presentation] . Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, Atlanta, GA, United States. [Google Scholar]
- Mirman, D. , Strauss, T. J. , Brecher, A. , Walker, G. M. , Sobel, P. , Dell, G. S. , & Schwartz, M. F. (2010). A large, searchable, web-based database of aphasic performance on picture naming and other tests of cognitive function. Cognitive Neuropsychology, 27(6), 495–504. https://doi.org/10.1080/02643294.2011.574112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moustroufas, N. , & Digalakis, V. (2007). Automatic pronunciation evaluation of foreign speakers using unknown text. Computer Speech & Language, 21(1), 219–230. https://doi.org/10.1016/j.csl.2006.04.001 [Google Scholar]
- Parikh, R. , Mathai, A. , Parikh, S. , Chandra Sekhar, G. , & Thomas, R. (2008). Understanding and using sensitivity, specificity and predictive values. Indian Journal of Ophthalmology, 56(1), 45–50. https://doi.org/10.4103/0301-4738.37595 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prud'hommeaux, E. T. , & Roark, B. (2011). Extraction of narrative recall patterns for neuropsychological assessment [Paper presentation] . Twelfth Annual Conference of the International Speech Communication Association, Florence, Italy. [Google Scholar]
- Rice, L. (1976). Hardware and software for speech synthesis. Dr. Dobb's Journal of Computer Calisthenics & Orthodontia, 1(4), 6–8. [Google Scholar]
- Roach, A. , Schwartz, M. F. , Martin, N. , Grewal, R. S. , & Brecher, A. (1996). The Philadelphia Naming Test: Scoring and rationale. Clinical Aphasiology, 24, 121–133. [Google Scholar]
- Rosvall, M. , & Bergstrom, C. T. (2010). Mapping change in large networks. PLOS ONE, 5(1), Article e8694. https://doi.org/10.1371/journal.pone.0008694 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salem, A. C. , MacFarlane, H. , Adams, J. R. , Lawley, G. O. , Dolata, J. K. , Bedrick, S. , & Fombonne, E. (2021). Evaluating atypical language in autism using automated language measures. Scientific Reports, 11(1), Article 10968. https://doi.org/10.1038/s41598-021-90304-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartz, M. F. , & Brecher, A. (2000). A model-driven analysis of severity, response characteristics, and partial recovery in aphasics' picture naming. Brain and Language, 73(1), 62–91. https://doi.org/10.1006/brln.2000.2310 [DOI] [PubMed] [Google Scholar]
- Schwartz, M. F. , Dell, G. S. , Martin, N. , Gahl, S. , & Sobel, P. (2006). A case-series test of the interactive two-step model of lexical access: Evidence from picture naming. Journal of Memory and Language, 54(2), 228–264. https://doi.org/10.1016/j.jml.2005.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tharwat, A. (2021). Classification assessment methods. Applied Computing and Informatics, 17(1), 168–192. https://doi.org/10.1016/j.aci.2018.08.003 [Google Scholar]
- van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). Butterworths. [Google Scholar]
- Walker, G. M. , Basilakos, A. , Fridriksson, J. , & Hickok, G. (2022). Beyond percent correct: Measuring change in individual picture naming ability. Journal of Speech, Language, and Hearing Research, 65(1), 215–237. https://doi.org/10.1044/2021_JSLHR-20-00205 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walker, G. M. , & Hickok, G. (2016a). Bridging computational approaches to speech production: The semantic–lexical–auditory–motor model (SLAM). Psychonomic Bulletin & Review, 23(2), 339–352. https://doi.org/10.3758/s13423-015-0903-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walker, G. M. , & Hickok, G. (2016b). Evaluating quantitative and conceptual models of speech production: How does SLAM fare? Psychonomic Bulletin & Review, 23(2), 653–660. https://doi.org/10.3758/s13423-015-0962-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walker, G. M. , & Schwartz, M. F. (2012). Short-form Philadelphia Naming Test: Rationale and empirical evaluation. American Journal of Speech-Language Pathology, 21(2), S140–S153. https://doi.org/10.1044/1058-0360(2012/11-0089) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson, S. M. , Yen, M. , & Eriksson, D. K. (2018). An adaptive semantic matching paradigm for reliable and valid language mapping in individuals with aphasia. Human Brain Mapping, 39(8), 3285–3307. https://doi.org/10.1002/hbm.24077 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, Y. , Kimberg, D. Y. , Coslett, H. B. , Schwartz, M. F. , & Wang, Z. (2014). Multivariate lesion–symptom mapping using support vector regression. Human Brain Mapping, 35(12), 5861–5876. https://doi.org/10.1002/hbm.22590 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data from this study were extracted from the freely available Moss Aphasia Psycholinguistics Project Database (MAPPD), which can be accessed via a login at http://www.mappd.org. The MAPPD-12K data set created for this study, along with a summary file containing an abbreviated version of the ParAlg output file and item-level qualitative analyses, is included in Supplemental Materials S2 and S3, as mentioned above. A copy of the ParAlg software is available upon request.

