Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2025 Jun 13;68(7):3322–3336. doi: 10.1044/2025_JSLHR-25-00003

ABCD: A Simulation Method for Accelerating Conversational Agents With Applications in Aphasia Therapy

Gerald C Imaezue a,, Harikrishna Marampelly b
PMCID: PMC12263185  PMID: 40512969

Abstract

Purpose:

Development of aphasia therapies is limited by clinician shortages, patient recruitment challenges, and funding constraints. To address these barriers, we introduce Agent-Based Conversational Dialogue (ABCD), a novel method for simulating goal-driven natural spoken dialogues between two conversational artificial intelligence (AI) agents—AI clinician (Re-Agent) and AI patient (AI-Aphasic), which vocally mimics aphasic errors. Using ABCD, we simulated response elaboration training between both agents with stimuli varying in semantic constraint (high via pictures, low via topics). Rather than resource-intensive fine-tuning, we leveraged prompt engineering, chain-of-thought (CoT) and zero-shot techniques for rapid, cost-effective agent development, and piloting.

Method:

Built on OpenAI's GPT-4o as the foundational large language model, Re-Agent and AI-Aphasic were supplemented with external speech-to-text and naturalistic text-to-speech application programming interfaces to create a multiturn, dynamic dialogue system in English. We used it to evaluate Re-Agent's conversational performance across four experimental conditions (CoT + picture, CoT + topic, zero-shot + picture, zero-shot + topic) and aphasic error at two levels: word and discourse errors. Re-Agent's performance was measured using three discourse metrics: global coherence, local coherence, and grammaticality of utterances.

Results:

Overall, Re-Agent performed accurately in all the discourse metrics across all levels of semantic parameter, prompting technique and aphasic error. The results also indicated that well-crafted zero-shot prompts induce more direct and logically related responses that are robust to adversarial aphasic speech inputs, whereas CoT might lead to responses that slightly lose local coherence due to additional complex reasoning chains.

Conclusions:

ABCD represents a foundational computational approach to accelerate the innovation and preclinical testing of conversational AI partners for speech-language therapy. ABCD circumvents the barriers of collecting diverse errorful speech samples for clinical conversational AI fine-tuning. As AI systems—including large language models and speech technologies—advance rapidly, ABCD will scale accordingly, further enhancing its potential for clinical integration.


Patients with neurological conditions such as stroke often develop aphasia, a language disorder characterized by fragmented and constrained speech outputs, significantly impacting quality of life. Evidence-based therapies, such as response elaboration training (RET), have been shown to enable communicative functions in people with aphasia (PWA) by facilitating the expansion of verbal responses during picture description through a chaining technique (Kearns, 1985). RET involves prompting PWA to elaborate on a picture or video (high-semantic constraint), thereby increasing the length and complexity of their utterances over time (e.g., Ahmadi et al., 2022; Bunker et al., 2019; Wambaugh & Martinez, 2000). However, the development and delivery of evidence-based therapies such as RET are constrained by a shortage of clinician–scientists and speech-language pathologists (SLPs; American Speech-Language-Hearing Association, 2015; Baker et al., 2021), challenges with patient recruitment for clinical studies as evidenced by dominant use of small N designs in intervention studies (Cavanaugh et al., 2022; Murray et al., 2013; Togher et al., 2009), and limited funding for clinical research and health care worldwide (e.g., Petersen, 2021).

Advances in artificial intelligence (AI) offers a promising solution to these limitations. However, AI-driven clinical care also poses risks such as inaccurate or biased outputs, reduced human oversight, and potential loss of human connection in therapy (Botha et al., 2024; Karimian et al., 2022; Pias et al., 2025). Ensuring responsible implementation requires rigorous development and validation with continued human clinician involvement. The development and responsible implementation of conversational AI agents for aphasia care can revolutionize aphasia therapy by providing scalable, personalized treatment options. A conversational AI agent is an AI system designed to simulate and facilitate natural humanlike interactions through text or speech with humans (Khatri et al., 2018). These agents leverage natural language processing, machine learning, and sometimes speech recognition and synthesis to interpret user inputs, generate relevant and coherent responses, and perform specific tasks (Mariani et al., 2023). We created a speech-based conversational AI agent, termed the Response Elaboration AI Agent (Re-Agent), which integrates a large language model (LLM) with speech-to-text and text-to-speech capabilities. This integration enables the Re-Agent to conduct interactive sessions that closely mimic human-led RET sessions. However, PWA errorful speech may be adversarial prompts to these systems. In natural language processing, adversarial prompts are those intentionally crafted prompts that cause LLMs to produce unexpected outputs (e.g., Das et al., 2024; Kumar et al., 2023). Here, the idea is that if an LLM receives input that mimics the disordered patterns of aphasic speech, it might be tricked or confused into generating contextually inappropriate responses in a similar way as it would be by a deliberately designed adversarial prompt. For example, consider Re-Agent administering RET using a picture depicting “a boy kicking a ball.” If a PWA with phonemic paraphasia (phonologically related real-word deviations) responds with boy uhm kick /f/all, it is crucial that Re-Agent does not erroneously elaborate on the errorful utterance, such as asking, “Why did the boy fall?” Instead, Re-Agent must recognize and correct the error to maintain the integrity of therapeutic communication. Therefore, Re-Agent must recognize and adapt to a variety of speech anomalies typical of aphasia, ensuring that the therapy remains on track and clinically relevant. This solution ideally requires fine-tuning and testing Re-Agent with a substantial number of PWA that exhibits a spectrum of aphasic speech errors. However, this may be cumbersome, time consuming, expensive, and resource intensive due to the logistical challenges with recruitment of PWA that manifest a wide spectrum of aphasic speech errors.

To address this gap, we introduce Agent-Based Conversational Dialogue (ABCD)––a novel method for simulating goal-driven natural spoken dialogues between conversational AI agents in a particular context. ABCD is useful for modeling and developing conversational AI agents navigating communication challenges, like disordered speech, and evaluating their behaviors in simulated natural spoken dialogue settings. We use ABCD to simulate RET sessions between Re-Agent and an auxiliary AI agent (AI-Aphasic) that manifests a spectrum of aphasic speech errors. We developed the AI-Aphasic to vocally imitate some aphasic speech errors when conversing with the Re-Agent for development, training and piloting purposes. ABCD combines agent-based modeling with speech-based conversational AI. Agent-based modeling involves simulating the behavior of individual AI agents and their interactions within a defined environment (e.g., Gao et al., 2024). ABCD combines agent-based modeling with speech-based conversational AI to simulate complex human spoken dialogues in a particular setting. In this study, ABCD expedites the development of Re-Agent and the systematic evaluation of its capabilities across a spectrum of aphasic word errors and discourse errors. Word errors include phonemic errors (phonologically related real-word deviations), semantic errors (semantically related real-word deviations), perseveration errors (repeated errorful responses), and structural (syntactic) errors (issues with word order and grammatical structures). Discourse errors encompass tangential speech (responses unrelated to the stimuli) and pathological language mixing in bilingual aphasia (involuntary switching between English and Spanish in an utterance).

Rather than relying on resource-intensive fine-tuning of the LLM, we employed prompt engineering techniques such as chain-of-thought (CoT) and zero-shot learning to explore the flexibility and adaptability of the Re-Agent across stimuli with varying semantic constraints—pictures (high semantic constraint) and discussion topics (low semantic constraint). Our approach is justified by the need to evaluate the optimal prompting technique for Re-Agent across a range of therapeutic scenarios. CoT prompting, a state-of-the-art approach, involves guiding LLMs to use step-by-step reasoning for implementing a given task (Wei et al., 2022). In this context, CoT helps Re-Agent to mimic the RET guidelines when implementing the procedure. This method is particularly valuable in therapy as it ensures that conversational AI agent implements a behavioral intervention in line with its evidence-based protocol. Zero-shot learning, on the other hand, allows the AI to apply knowledge from one situation to a novel situation without providing explicit examples to guide its behavior (e.g., Kojima et al., 2022). This capability, dependent on the LLM pretrained capability, is crucial for adapting to the unique and varied linguistic errors typical in aphasia (Cong et al., 2024). Our central research questions are:

  1. How do the four conditions (CoT + Pictures, CoT + Topics, Zero-shot + Pictures, Zero-shot + Topics) impact the Re-Agent's performance?

  2. Are certain aphasic error types more challenging for Re-Agent across all four conditions?

Our study uses ABCD to evaluate Re-Agent's performance through three discourse metrics in clinical aphasiology: global coherence, local coherence, and grammaticality of utterances (Glosser & Deser, 1991; Van Leer & Turkstra, 1999). Although there's a multitude of evaluation metrics for conversational AI in health care (Abbasian et al., 2024), the selected metrics are essential as a first step for evaluating the Re-Agent's capacity to participate in clinical conversations ensuring the use of grammatically correct, semantically appropriate utterances that maintain a coherent and logical flow in line with RET protocol in a given context.

ABCD is a novel method for developing, prototyping and piloting scalable conversational AI partners, like for speech-language therapy, in simulated natural spoken dialogue settings prior to human testing. This method expedites the development of conversational AI agents that potentially supplement human-driven treatments and extend essential aphasia care to underserved populations.

Method

Experimental Design

We employed a 2 × 2 × 2 repeated-measures factorial design to systematically investigate the effects of prompting techniques, semantic parameters, and aphasic error types on Re-Agent's conversational performance (see Figure 1). The design includes between-conditions factors for prompting techniques and semantic parameters, while aphasic error type serves as a within-conditions factor (see Figure 1). This structured approach enables a detailed analysis of how varying prompts and contextual factors influence the effectiveness of therapy simulations conducted by conversational AI agents specifically tailored for RET scenarios.

Figure 1.

A 2 by 2 matrix. The columns represent semantic parameters which are High constrained (Picture), and Low constrained (Topic). The rows represent Prompting techniques which are chain of thought and zero shot. The cell entries are as follows. Cell 1. [Chain of thought, Picture], word errors, discourse errors. Cell 2. [Chain of thought, Topic], word errors, discourse errors. Cell 3. [Zero shot, Picture], word errors, discourse errors. Cell 4. [Zero shot, Topic], word errors, discourse errors.

A 2 × 2 × 2 factorial matrix of prompting technique, semantic parameter, and aphasic error type at two levels each.

Each of the four conditions—CoT with pictures, CoT with topics, zero-shot with pictures, and zero-shot with topics—comprises two factors: word errors (phonemic, semantic, perseveration, and structural errors) and discourse errors (tangential speech and language mixing). Re-Agent and AI-Aphasic (simulated patient) used 30 picture captions or topics as contexts to guide their conversations per factorial condition. The picture captions described simple subject–object–verb–complement relations, and the topics were on common and relatable subjects and events (e.g., favorite sports). Our setup enabled 1,375 conversational turns across all conditions, allowing for deep interactions and sufficient data collection to analyze Re-Agent's potential therapeutic efficacy. The study was not subject to an approval process.

Development of Conversational AI Agents: Re-Agent and AI-Aphasic

Re-Agent and AI-Aphasic were developed using OpenAI's GPT-4o as their base LLM, supplemented with application programming interfaces for external speech-to-text and naturalistic text-to-speech models in varying voices of English speakers (see Figure 2). The LLM, the core of our technologies, was instructed using prompt engineering. We opted for prompt engineering, instead of traditional LLM fine-tuning, due to several compelling scientific and practical reasons that align well with the needs and objectives of our study. Firstly, from a resource management perspective, prompting is significantly more efficient compared to the traditional fine-tuning of LLMs. Fine-tuning requires large amounts of training data (aphasic speech samples), substantial computational resources, and considerable time to implement changes effectively. In contrast, prompt engineering leverages the existing capabilities of advanced, state-of-the-art LLMs without additional resource-intensive training, making it a cost-effective solution that enhances the scalability and rapid deployment of our therapeutic AI tools. Furthermore, prompting simplifies the implementation and updating process of LLMs. In the rapidly evolving field of clinical therapy, treatment strategies frequently need refinement based on the latest research or feedback from practice. With prompting, adjustments to the AI's therapeutic behavior can be made by simply modifying the prompts or the structure of the prompts, rather than undergoing the cumbersome process of retraining the model. This not only saves time but also enhances the ability to iterate, optimize, and scale Re-Agent and related AI-driven therapeutic approaches quickly in response to emerging needs.

Figure 2.

A block diagram for multiturn conversation between 2 agents: Agent 1 (Re-Agent) and Agent 2 (AI-Aphasic). Each agent uses an LLM. The inputs to the LLM of the first agent is speech to text data, and command from a prompt template. The output of the first agent's LLM is text to speech data. The text to speech data of the first agent is converted to speech to text data input for the LLM of the second agent. The second agent's LLM also receives command input from a prompt template. The output of the second agent's LLM is text to speech data which is then converted back to text and used as the input for the LLM of the first agent. Both agents converse about a picture or a topic.

Schematic of Agent-Based Conversational Dialogue for simulating natural spoken dialogues between Re-Agent and AI-Aphasic. LLM = large language model.

Prompting Templates for Re-Agent

CoT Prompting

This is a prompting technique used in LLMs that encourages step-by-step reasoning to improve problem-solving, particularly in tasks requiring logical inference, arithmetic calculations, or multistep reasoning (Wei et al., 2022). In this study, we created and employed a specialized CoT prompting template designed with explicit reasoning guidelines and examples to guide and structure Re-Agent's therapeutic interactions with AI-Aphasic in line with RET protocol. The CoT template integrates a series of structured prompts that simulate a step-by-step reasoning process underpinning the implementation of RET. This approach assists Re-Agent in generating responses that not only adhere to the therapeutic goals but also encourage patients to expand their utterances coherently and contextually. The template was crafted to elicit detailed and extended dialogue from patients by gradually building complexity and depth in the conversation topics, which aligns with the principles of RET. While the specifics of the template are proprietary, its implementation demonstrated Re-Agent's ability to implement RET sessions.

Zero-Shot Prompting

This is a technique in which a LLM is given a task without any prior examples making the model rely entirely on its pretrained knowledge to generate a response (e.g., Xian et al., 2020). We used this technique to enhance Re-Agent's ability to interact effectively with the AI-Aphasic in a diverse range of conversational scenarios without explicit reasoning steps or examples on those specific tasks. This method supports our technology in applying broad, pretrained knowledge to novel situations, aiding in the generation of relevant and contextually appropriate responses during simulated therapy sessions. Although the detailed contents of the zero-shot template remain confidential due to proprietary reasons, its use in this study enabled Re-Agent to adapt dynamically to a variety of linguistic challenges presented by AI-Aphasic, without any RET guideline or example. The general overview of both prompting techniques is sufficient to guide replication efforts and future ABCD applications for simulating other multi-agent natural spoken dialogues.

Prompt Engineering of AI-Aphasic Agent

To enhance the Re-Agent's therapeutic capabilities, we assessed its conversational performance under two categories of patholinguistic errors: word-level and discourse-level. Word-level errors included phonemic errors (phonologically related but real-word deviations), semantic errors (meaning-related real-word substitutions), perseveration errors (repetitive, error-prone responses), and structural (syntactic) errors (anomalies in word order and grammar). Discourse-level errors involved tangential speech (responses unrelated to the prompt) and pathological language mixing (involuntary shifts between English and Spanish within an utterance).

The AI-Aphasic was programmed with a few-shot prompt template—prompt that includes few examples of the target aphasic errors—designed to generate word and discourse errors, with adjustable settings to focus on specific error types during simulated RET sessions (see Appendix). We prompted the AI-Aphasic model to produce these errors—one error subtype per experimental condition—imitating a patient with moderate nonfluent aphasia. Each of the within-factor condition (aphasic error type) targeted a specific error type. Our goal was to measure Re-Agent's conversational performance when confronted with these patholinguistic errors.

The AI-simulated aphasic errors were reviewed and validated by author GCI, an expert in aphasia who is familiar with the target errors. Additionally, two trained SLPs, independent of the study, conducted further validations.

ABCD Simulation of RET Sessions

The core treatment modality used in this study is the traditional RET protocol (Kearns, 1985), which is adapted for automated delivery by the Re-Agent (AI-SLP). RET, in this context, is designed to encourage AI-Aphasic to elaborate its responses, using a chaining technique, promoting richer and more complex language use over time.

We used the two prompts to guide Re-Agent's conversation with the AI-Therapist across the two tasks—pictures and topics. Both agents shared the same contextual space (e.g., picture or topic) to ensure that their conversations were related to a particular stimulus or task (see Figure 2). In the case of picture, we used ones that depict simple subject–verb–object–complement relationships. We then created captions of the pictures and included them in the prompt templates of both AI agents. As for topics, we used familiar topics such as sports. We instructed the AI-Aphasic to discuss these topics by choosing a specific theme—for example, sports: basketball—and make specific errors in line with the theme. For instance, during tangential speech, the AI-aphasic can produce utterances about other topics but not sport and basketball while the Re-Agent is expected to use the topic of sport to redirect the conversation back to the target task (see Table 1).

Table 1.

Dialogue between Re-Agent and AI-Aphasic on the topic of “favorite sport.”

AI Agent Dialogue
Re-Agent Hi! Let's start our conversation by talking about your favorite sport. Can you tell me what your favorite sport is?
AI-Aphasic Play ball.
Re-Agent Great! You like playing ball. Can you tell me which ball sport you enjoy playing the most?
AI-Aphasic Basketball. I like the my hello garden is looking really green this time of year.
Re-Agent That's wonderful to know you enjoy playing basketball. Let's stay on the topic of basketball for now. What do you like most about playing basketball?
AI-Aphasic Yep.
Re-Agent Please try to respond with a phrase or sentence. Can you tell me what you enjoy most about playing basketball?
AI-Aphasic Shooting hoop. New shoes. Dribbling fun. Seasoning steak makes it tasty.
Re-Agent You like shooting hoops and dribbling, and it sounds like you also got new shoes for playing basketball. That's great! Let's focus on basketball. How often do you play basketball in a week?
AI-Aphasic 3 times a week.

Note. AI-Aphasic is prompted to demonstrate tangential speech during the simulated therapy session.

Given that both AI-agents had speech capabilities, we simulated spoken conversations in English using varying voices between a therapist implementing RET to a patient who produced distinct types of aphasic errors. In line with ABCD, Re-Agent initiated the simulated session, and its speech output was fed as input to the AI-Aphasic whose output was fed back to Re-Agent as its input (see Figure 2). They engaged in this multiturn, dynamic natural spoken dialogue for each factorial block. This method simulated and expedited the development and evaluation of Re-Agent's performance when interacting with simulated aphasic speech.

Data Collection and Scoring

Data collection in this study involves capturing and saving the interactions between the Re-Agent and the AI-Aphasic across different conditions per factorial block for detailed analysis. We used automatic scoring of conversational transcripts for global coherence, local coherence, and grammaticality (see Appendix). Local coherence evaluates how well each utterance of Re-Agent's conversation logically and meaningfully follows from the preceding ones in line with RET. This metric ensures that Re-Agents conversation progresses smoothly and logically, without abrupt shifts or inconsistencies due to the AI-Aphasic errorful utterances. We adapted and used the following 5-point Likert scale to measure Re-Agent's local coherence (Glosser & Deser, 1991; Van Leer & Turkstra, 1999): 1—Radical topic shift, unintelligible utterance, comment on the discourse; 2—moderate topic shift or occasional lack of clarity in referencing to prior utterance; 3—slight topic shift or referentially vague; 4—mostly coherent with minor inconsistencies or ambiguities in relation to the preceding utterance(s); and 5—relation through continuation, elaboration, repetition, subordination, or coordination of ideas from the preceding utterance(s).

Global coherence evaluates the overall thematic and contextual consistency of each utterance produced by the Re-Agent. It examines whether Re-Agent's conversations remain focused on the intended topic or task across multiple turns. This metric ensures that the Re-Agent maintains a coherent, relevant, and purposeful interaction across the entire conversation, sticking to the intended themes and goals regardless of AI-Aphasic adversarial errors. We adapted and used the following 5-point Likert scale to measure Re-Agent's global coherence (Glosser & Deser, 1991; Van Leer & Turkstra, 1999): 1 = no relation at all to the topic, unintelligible, or comment on the discourse; 2 = weak or tangential relation to the topic; 3 = possible relation to the topic; 4 = clear relation to the topic but may lack full relevance or specificity; and 5 = concrete information directly and strongly related to the topic.

Grammaticality measures mistakes in verb tense, subject–verb agreement, pronoun consistency, preposition usage, and sentence structure that may affect the intelligibility of an utterance. Each of the Re-Agent's responses receives a score of 1–5 for Grammaticality using this scale: 1 = major errors that render the response largely unintelligible or completely irrelevant; 2 = significant errors that make understanding difficult; 3 = moderate errors that may confuse the user, but the overall meaning is discernible; 4 = minor errors that do not impede understanding; and 5 = no grammatical or semantic errors.

After automatic scoring of the transcripts, we then randomly extracted 10% of the data across all conditions for human scoring by the two clinically certified SLPs: one English-speaking SLP and the other is a native Spanish and English speaker. The SLPs were not involved in the study, and they were blinded to the conditions. The percentage agreement among the three raters for the performance metrics is approximately 99.67%. This high value suggests a strong consensus between the automated and human raters across the evaluated dimensions.

Statistical Analyses

All statistical analyses were conducted to evaluate Re-Agent's conversational performance across four experimental conditions. The primary objective was to determine whether the AI-Agent's performance varied significantly across these conditions. To begin, a descriptive analysis was performed to summarize the data set. Measures of central tendency (M) and dispersion (SD) were calculated for the three key performance metrics: grammaticality, local coherence, and global coherence. These descriptive statistics provided an initial overview of the AI-Agent's performance trends across the different variables and their interactions, laying the groundwork for subsequent statistical tests. To address the ceiling effect in the response variable, we transformed performance metric into a binary outcome. Scores of 5 were categorized as 1 (indicating high local coherence), while scores of 4 or lower were assigned a value of 0 (representing lower local coherence). Each factor was coded as a categorical variable: prompting technique (CoT = 1, zero-shot = −1), semantic parameter (picture = 1, topic = −1), and aphasia error type (word-level error = 1, discourse-level error = −1). To investigate the effects of each factor and their interactions on Re-Agent's local coherence, we employed binary logistic regression using the glm() function in R Studio (Version 2024.04.2 + 764) with a binomial family link function. Wald's z tests (Wald, 1943) was used to evaluate model significance, and odds ratios with 95% confidence intervals were calculated to interpret effect sizes. Further post hoc analysis was conducted using pairwise comparisons of estimated marginal means for interaction effects with p values Bonferroni-adjusted for multiple comparisons. We used the emmeans package in R Studio (Version 2024.04.2 + 764) to perform this analysis.

Results

Statistical Summary of Re-Agent Performance per Condition

The analysis examined global coherence, local coherence, and grammaticality across four experimental conditions: CoT + picture, CoT + topic, zero-shot + picture, and zero-shot + topic. Both CoT and zero-shot approaches performed exceptionally well across all measures.

Figure 3 shows that for global coherence, all conditions demonstrated consistently high scores. The zero-shot + picture condition achieved a perfect M of 5.0 (SD = 0.0), indicating no variability. The CoT + topic condition followed closely (M = 4.997, SD = 0.051), with strong coherence and minimal variation. Zero-shot + topic (M = 4.988, SD = 0.070) and CoT + picture (M = 4.971, SD = 0.082) also performed exceptionally well, suggesting that both CoT and zero-shot approaches maintained high coherence, regardless of whether paired with pictures or topics.

Figure 3.

A bar graph for the mean score by conditions and outcome measures. The conditions are CoT and picture, CoT and topic, Zero shot and picture, and Zero shot and topic. The outcome measures are global coherence, local coherence, and grammaticality. The data are as follows. 1. CoT and picture. Global coherence: 4.97. Local coherence: 4.86. Grammaticality: 4.97. 2. CoT and topic. Global coherence: 4.99. Local coherence: 4.82. Grammaticality: 5. 3. Zero-shot and picture. Global coherence: 5. Local coherence: 4.85. Grammaticality: 5. 4. Zero-shot and topic. Global coherence: 4.98. Local coherence: 4.93. Grammaticality: 5.

Statistical summary of Re-Agent global coherence, local coherence, and grammaticality mean scores per condition. CoT = chain-of-thought.

Figure 3 reveals that for local coherence, slight differences emerged. Zero-shot + topic showed the highest mean score (M = 4.925, SD = 0.080), followed by CoT + picture (M = 4.867, SD = 0.115), zero-shot + picture (M = 4.848, SD = 0.101), and CoT + topic (M = 4.820, SD = 0.128), which exhibited the most variability. Despite minor differences, all conditions maintained high local coherence.

Figure 3 shows that for grammaticality, results were highly consistent. CoT + topic, zero-shot + picture, and zero-shot + topic all achieved a perfect M score of 5.0 (SD = 0.0), indicating flawless grammatical accuracy. The CoT + picture condition, while slightly lower, still performed exceptionally well (M = 4.974, SD = 0.082).

Statistical Summary of Re-Agent Performance per Error and Condition

The analysis of mean and standard deviation values across grammaticality, local coherence, and global coherence shows consistently high Re-Agent performance across all experimental conditions (CoT + pictures, CoT + topics, zero-shot + pictures, zero-shot + topics) and aphasic error types (perseveration, phonemic, semantic, structural, language mixing, tangential speech).

For grammaticality, near-perfect mean scores were observed across most conditions and error types (see Figure 4). In the CoT + Pictures condition, language mixing, phonemic, semantic, and structural errors all scored 5.0 (SD = 0.0), indicating absolute consistency, while tangential speech and perseveration error showed slight variability (M = 4.98, SD = 0.13). A similar trend appeared in the zero-shot + pictures condition, where phonemic, semantic, and structural errors remained at 5.0 (SD = 0.0), while perseveration error dropped slightly (M = 4.95, SD = 0.15). The CoT + topics and zero-shot + topics conditions followed these patterns, with minor dips in perseveration error scores under topic-based prompts.

Figure 4.

Three bar graphs show variations in three scores by aphasic errors. The errors are language mixing, preservation error, phonemic error, semantic error, structural error, and tangential speech. The conditions are COT and pictures, COT and topic, Zeroshot and picture, and Zeroshot and topic. The first graph is for the mean grammaticality score. The data are as follows. 1. Language mixing. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. 2. Preservation error. COT and pictures: 4.98. COT and topic: 4.9. Zeroshot and picture: 4.95. Zeroshot and topic: 5. 3. Phonemic error. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. 4. Semantic error. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. 5. Structural error. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. 6. Tangential speech. COT and pictures: 4.8. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. The second graph is for the mean local coherence score. 1. Language mixing. COT and pictures: 4.96. COT and topic: 4.8. Zeroshot and picture: 4.8. Zeroshot and topic: 4.9. 2. Preservation error. COT and pictures: 4.8. COT and topic: 4.8. Zeroshot and picture: 4.8. Zeroshot and topic: 4.9. 3. Phonemic error. COT and pictures: 4.83. COT and topic: 4.8. Zeroshot and picture: 4.9. Zeroshot and topic: 5. 4. Semantic error. COT and pictures: 4.83. COT and topic: 4.85. Zeroshot and picture: 4.8. Zeroshot and topic: 4.95. 5. Structural error. COT and pictures: 5. COT and topic: 4.8. Zeroshot and picture: 4.85. Zeroshot and topic: 4.9. 6. Tangential speech. COT and pictures: 4.8. COT and topic: 4.85. Zeroshot and picture: 4.9. Zeroshot and topic: 4.9. The third graph is for the mean global coherence score. 1. Language mixing. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 4.9. 2. Preservation error. COT and pictures: 4.9. COT and topic: 5. Zeroshot and picture; 5. Zeroshot and topic: 5. 3. Phonemic error. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. 4. Semantic error. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. 5. Structural error. COT and pictures: 5. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 5. 6. Tangential speech. COT and pictures: 4.9. COT and topic: 5. Zeroshot and picture: 5. Zeroshot and topic: 4.9.

Statistical summary of Re-Agent performance per aphasia error type. CoT = chain-of-thought.

For local coherence, high mean scores were maintained, though with slightly greater variability than grammaticality (see Figure 4). In the CoT + pictures condition, language mixing achieved M = 4.96 (SD = 0.20), while phonemic (M = 4.83, SD = 0.38) and semantic errors (M = 4.83, SD = 0.37) showed minor inconsistency. In the zero-shot + pictures condition, language mixing and structural errors remained stable (M = 4.95, SD = 0.0), but perseveration error showed greater variability (M = 4.85, SD = 0.44). The CoT + topics and zero-shot + topics conditions followed similar trends, with the greatest variability in perseveration and phonemic errors.

For global coherence, performance was stable across all conditions and error types (see Figure 4), with most M scores at 5.0 (SD = 0.0). In the CoT + pictures condition, language mixing, phonemic, semantic, and structural errors exhibited no variability, while tangential speech and perseveration error showed minor inconsistency (M = 4.9, SD = 0.4). Similar patterns appeared in the zero-shot + pictures condition, where phonemic, language mixing, and structural errors remained at 5.0 (SD = 0.0), while perseveration error dropped slightly (M = 4.88, SD = 0.42). The CoT + topics and zero-shot + topics conditions mirrored these trends, with minor reductions in perseveration and phonemic errors.

Statistical Analysis of the Effects of Factors on Local Coherence

Statistical summary shows that Re-Agent grammaticality and global coherence mean values were at ceiling across all conditions but not local coherence (see Figure 3). As a result, we only investigated the effects of each factor and their interactions on Re-Agent's local coherence. The results, as shown in Table 2, indicate that the coefficient for prompting technique is negative, which means that increasing the value of prompting technique, moving from zero-shot (−1) to CoT (1), decreases the log-odds of correctness. Since the result is statistically significant, the choice of prompting technique has a meaningful impact on the likelihood of producing a very high locally coherent response. This result could indicate that well-crafted zero-shot prompting encourages more direct and logically related responses, whereas CoT might lead to responses that slightly lose local coherence due to additional complex reasoning chains. Additionally, a significant Prompting Technique × Semantic Parameter interaction suggests that the effect of prompting technique depends on the type of semantic input.

Table 2.

Results of binary logistic regression model for local coherence.

Predictor Estimate (log-odds) SE z value p value
(Intercept) 1.95145 0.08952 21.800 < .0001*
Prompting technique −0.22260 0.08952 −2.487 .0129 *
Semantic parameter −0.08174 0.08952 −0.913 .3612
Error type −0.03141 0.08952 −0.351 .7257
Prompting Technique × Semantic Parameter 0.29491 0.08952 3.294 < .0001*
Prompting Technique × Error Type 0.06514 0.08952 0.728 .4668
Semantic Parameter × Error Type −0.04883 0.08952 −0.546 .5854
Prompt Technique × Semantic Parameter × Error Type 0.05288 0.08952 0.591 .5547

Note. Asterisk (*) = statistically significant.

Further post hoc analysis using pairwise comparisons of prompting technique (CoT vs. zero-shot) and semantic parameter (picture vs. topic) shows that zero-shot performed significantly better with topics for optimizing Re-Agent's local coherence (see Table 3). However, neither the main effects of semantic parameter and error type nor their high-order interactions significantly influenced Re-Agent's local coherence.

Table 3.

Results of pairwise comparisons of prompting technique (chain-of-thought [CoT] vs. zero-shot) and semantic parameter (picture vs. topic) for local coherence.

Contrast Estimate SE z ratio p value
(zero-shot, topic) – (CoT, topic) 1.035 0.264 3.915 .0005*
(zero-shot, topic) – (Zero-shot, picture) 0.753 0.278 2.705 .0410*
(zero-shot, topic) – (CoT, picture) 0.609 0.283 2.154 .1876
(CoT, topic) – (zero-shot, picture) −0.282 0.220 −1.281 1.0000
(CoT, topic) – (CoT, picture) −0.426 0.225 −1.894 .3494
(zero-shot, picture) – (CoT, picture) −0.145 0.241 −0.599 1.0000

Note. This table presents post hoc pairwise contrasts comparing combinations of prompting technique and semantic parameter on local coherence. Significant effects (*) indicate conditions where one prompting strategy or semantic input type significantly outperforms another. p values are Bonferroni-adjusted for multiple comparisons.

Discussion

We created the ABCD method and used it to simulate, pilot, and evaluate spoken conversational performance of an AI-Agent for aphasia therapy, Re-Agent, under four distinct experimental conditions (CoT + picture + word and discourse errors, CoT + topic word and discourse errors, zero-shot + picture word and discourse errors, and zero-shot + topic word and discourse errors). The findings suggest Re-Agent's grammaticality and global coherence metrics were very high regardless of prompting technique, semantic parameter, and aphasic error type. Re-Agent's local coherence was also high, but it was influenced by prompting technique and its interaction with semantic parameter.

Our key finding suggests that Re-Agent's performance may be optimized for adhering to RET's protocol through a well-crafted zero-shot prompt rather than CoT which, perhaps, adds unnecessary complex chains of reasoning steps. Conversational AI partners for aphasia therapy, with LLMs as their base technology, may need to be provided with well-crafted zero-shot prompts which could be sufficient to maximize the pretrained capabilities of state-of-the-art LLMs to implement evidence-based aphasia therapy protocols that are robust against adversarial aphasic speech inputs. Furthermore, current state-of-the-art LLMs are likely trained with data on how current evidence-based aphasia therapy protocols are implemented which may account for Re-Agent's performance without explicit therapy protocol reasoning guidelines. We suggest that zero-shot prompting may be further optimized by adding additional examples (i.e., few shots prompting) to improve its performance against adversarial aphasic speech inputs (Brown et al., 2020). Further research is needed to test this idea. Our findings also revealed that Re-Agent's local coherence, using zero-shot, was optimized when it implemented RET using topics compared to pictures. This suggests that Re-Agent's local coherence during conversations within flexible contexts and low semantic constraint may be optimized using zero-shot prompting, which leverages the LLM's pretrained state-of-the-art capabilities (Kojima et al., 2022; Wei et al., 2021; Xian et al., 2020).

Overall, Re-Agent demonstrated a remarkable degree of stability and consistency in its grammaticality and global coherence across all tested conditions. This finding holds crucial implications for clinical settings. First, it indicates that Re-Agent is not overly sensitive to how prompts or contexts are framed (CoT vs. Zero-shot or Picture vs. Topic). This robustness indicates the potential of Re-Agent's flexibility in tailoring therapy sessions without worrying about performance degradation when shifting between pictures or topics given an appropriate structured prompt. Additionally, the Re-Agent's consistent handling of varying error types suggests that its underlying pretrained language processing and prompt optimization mechanisms are adept at managing phonemic errors, semantic substitutions, structural errors, perseverations, tangential responses, and language mixing in standard English and Spanish. For aphasia therapy, where unpredictability in speech errors is a defining feature, such resilience in conversational AI agents may minimize the risk of session breakdowns and helps maintain engagement. This finding reinforces the agent's potential reliability as a therapeutic partner across different aphasia profiles (Cong et al., 2024). Moreover, the observed consistency across stimuli with varying semantic constraints (picture or topic) implies that Re-Agent could be deployed in both structured and unstructured therapy settings. For example, picture-based prompts might be ideal for patients who struggle with word retrieval during conversations (Efstratiadou et al., 2018), topic-based conversations could support more advanced therapy goals such as naturalistic speech production (e.g., Leaman & Edmonds, 2024). The AI agent's ability to maintain stable performance across these scenarios opens promising avenues for personalized therapy roadmaps without requiring retraining or context-specific optimization of the agent, which is expensive and resource intensive.

From a clinical integration perspective, the results suggest that Re-Agent could potentially serve as a scalable solution for personalized aphasia therapy, especially in resource-limited settings where access to skilled SLPs is constrained. With its resilience to variations in error type and task design, the agent could effectively serve as a standalone intervention tool or as an adjunct to human-delivered therapy. For instance, Re-Agent, after rigorous human patient testing, could be deployed in home-based therapy setups, allowing patients to practice conversational exercises outside clinical hours.

This study introduces a novel simulation method for developing and piloting conversational AI agents for aphasia therapy before human testing. Future studies should introduce noise sources into ABCD models to simulate natural spoken dialogues involving aphasic speech in noise caused by environmental sources or physiological ones (e.g., speech distortions due to motor speech disorders). In addition, future studies should explore the accuracy of ABCD simulated conversational agents using diverse LLMs and prompting techniques with human patients with aphasia across a more diverse set of conversational performance metrics such as user acceptability and preference, local and global coherence, validation of agent's implementation of therapy protocol, and so forth. Our work raises additional questions on potential risks and safety concerns about evaluation of conversational AI clinicians' performance during simulated and human testing. Addressing these questions will inform future longitudinal studies on the clinical benefits of conversational AI partners in aphasia care and speech-language rehabilitation in general.

Conclusions

ABCD represents a foundational computational approach to accelerate the innovation and preclinical testing of conversational AI partners for speech-language therapy. In this proof-of-concept study, we conduct a preliminary investigation into developing and applying ABCD to create a conversational AI clinician (Re-Agent) that implements an evidence-based aphasia therapy (RET) in English. By simulating patient–clinician natural spoken dialogue, ABCD overcomes the logistical, financial, and resource challenges of collecting human aphasic speech data for fine-tuning pretrained AI models during preclinical testing. While ABCD is demonstrated here in the context of aphasia therapy, the method provides a scalable and generalizable approach for innovating and piloting conversational AI agents across diverse clinical and nonclinical applications. As AI systems—including LLMs and advanced speech technologies—continue to evolve rapidly, we anticipate that ABCD's capabilities will scale accordingly, further enhancing its potential for clinical integration.

Data Availability Statement

Data are available upon reasonable request from the corresponding author.

Acknowledgments

This research was partly supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development under Award P2CHD101899—Center for Smart Use of Technology to Assess Real-world Outcomes (C-STAR), with G.C.I. as subproject PI. The study was also supported by an internal grant (CBCS Internal Grant Program award, awarded to G.C.I.). We thank Krishna Veni Maram for her feedback on software development. We gratefully acknowledge David Ajayi and Celine Davis for their invaluable assistance in evaluating and scoring samples of the simulated speech data.

Appendix

Prompt Template for AI-Aphasic Patient

Background

You must assume the personality of a regular American Joe when you are asked to talk about a topic (e.g., Thanksgiving or morning routine). You must respond like someone with a specific aphasic error type who is engaged in response elaboration training with a speech-language pathologist.

You must demonstrate specific linguistic behaviors of nonfluent aphasia, and you are involved in response elaboration training with a speech-language therapist. Response elaboration training is a therapeutic conversational technique that helps individuals with aphasia practice expanding their responses and providing more detailed information over time. As a nonfluent aphasic, you must engage in the training, within a specified topic, with the therapist while producing either phonemic errors, semantic errors, perseveration errors, structural (syntactic) errors, tangential speech or language mixing but not all these errors at once. You must use the topic to guide your interactions. During your interactions, your initial response must be an errorful phrase with the most minimal description of the topic. You are encouraged to expand on your initial response over time in response to the questions from the therapist. You are expected to exhibit only one class of linguistic errors among these errors described below:

Types of Errors with Examples

Phonemic Errors: Involve phonetic distortions or substitutions, where sounds within words are jumbled or replaced incorrectly. For example, you might say “breen” instead of “green” when discussing colors.

Other examples of phonemic errors:

  • Cat → Cap

  • Dog → Log

  • Book → Cook

  • Tree → Three

  • Bat → Pat

  • Pen → Ten

  • Chair → Cheer

  • Hat → Hot

  • Bird → Birk

  • Ball → Boll

Semantic Errors: Involve using a word that is semantically related to the correct word but not the correct one. For example, saying “chair” when you mean “table.”

Other examples of semantic errors

  • Cat → Dog

  • Car → Truck

  • Table → Chair

  • Apple → Orange

  • Knife → Fork

  • Bird → Eagle

  • Cow → Horse

  • Bread → Butter

  • Hat → Coat

  • Tree → Bush

Perseveration Errors: Repeating a particular response or word, even when it is not contextually appropriate. For example, repeatedly saying “car” when the topic has shifted from transportation to food.

Other example of perseveration errors

In a discourse task, a person with aphasia might be asked to describe a picture of a picnic, and after correctly naming “sandwich” early in the task, they might later refer to unrelated objects, such as a “tree” or “blanket,” as “sandwich” repeatedly, even though they are trying to name new items.

Structural errors: Impaired word order structure and omission of grammatical elements. For example, an individual with aphasia might produce sentences like “Car driving John” instead of “John is driving the car or ball kick boy instead of boy kicking the ball.

Tangential speech– Straying from the topic without realizing it, providing information that is unrelated to the initial topic. For example, starting with a discussion on weather but abruptly shifting to a favorite TV show without a clear link.

Other Example: Picture depicting a boy kicking a ball.

Therapist: What can you tell me about this picture?

You: Uh boy kick

Therapist: The boy is kicking something. Good. What is the boy kicking?

You: I love chicken soup (tangential speech).

Language mixing– Switching between languages unexpectedly or mixing languages within a single sentence, often without awareness. For example, starting a sentence in English and finishing it in Spanish without transition.

English and Spanish:

“I need to go a la tienda for … uh … groceries.”

“Can you give me el teléfono? I want to llamar.”

“She's cooking el pollo for … uh … cena tonight.”“

Prompt Template for Automated Evaluation of Re-Agent's Performance

“Like an expert evaluator of discourse samples, evaluate the grammaticality score, dialogue score, global coherence and local coherence of each utterance of the assistant (not user). Grammaticality is the extent to which responses are free of grammatical and semantic errors. Global coherence is the relation of each assistant's utterance to the context during discourse. Local coherence is the relation of the assistant's adjacent utterance to each other during discourse given a context. On a Likert scale of 1–5, evaluate the grammaticality, global coherence and local coherence of each of the assistant's (Re-Agent) utterance relative to a context using these rules:

Grammaticality:

  • Grammatical errors include mistakes in verb tense, subject-verb agreement, pronoun consistency, preposition usage, and sentence structure. Semantic errors involve improper word choice, logical inconsistencies in an utterance, or phrases that don't make sense within the given context or topic.

  • Each assistant's response receives a score of 1–5 for Grammaticality using this scale below:

1–Major errors that render the response largely unintelligible or completely irrelevant. 2–Significant errors that make understanding difficult. 3–Moderate errors that may confuse the user, but the overall meaning is discernible. 4–Minor errors that do not impede understanding. 5–No grammatical or semantic errors.

Local Coherence:

  • Assess the logical flow of the conversation within each segment or turn.

  • Evaluate how well the assistant's utterances build on previous responses, ensuring a smooth and logical progression of the conversation.

  • Note instances of abrupt topic shifts, contradictions, or responses that lack sufficient follow-up.

  • Each assistant's response receives a score of 1–5 for Local Coherence using this scale below:

1 – Radical topic shift, unintelligible utterance, comment on the discourse. 2 – Moderate topic shift or occasional lack of clarity in referencing to prior utterance. 3 – Slight topic shift or referentially vague. 4 – Mostly coherent with minor inconsistencies or ambiguities in relation to the preceding utterance(s). 5 – Relation through continuation, elaboration, repetition, subordination, or coordination of ideas from the preceding utterance(s).

Global Coherence:

  • Evaluate the contextual relevance of each prompt given by the assistant across the entire conversation.

  • Consider whether the assistant maintains a consistent thematic focus or guiding structure throughout the dialogue.

  • Each assistant's response receives a score of 1–5 for Local Coherence using this scale below:

1 – No relation at all to the topic, unintelligible, or comment on the discourse. 2 – Weak or tangential relation to the topic. 3 – Possible relation to the topic. 4 – Clear relation to the topic but may lack full relevance or specificity. 5 – Concrete information directly and strongly related to the topic.

Instructions for Scoring and Data Organization: Generate a table with scores for each utterance. Also, create an additional column in the table where you provide brief explanations to justify each score.

Examples of Re-Agent and AI-Aphasic Dialogues

Table A1.

Error type: Word error (Perseveration error) and Topic: Favorite fruit.

AI Agent Dialogue
Re-Agent Hi there! Let's get started. Can you tell me about your favorite fruit?
AI-Aphasic Apple apple
Re-Agent Great! I see you like apples. Can you tell me why apples are your favorite fruit?
AI-Aphasic apple. Sweet apple.
Re-Agent Apples are sweet, that's a good reason to like them! Can you tell me more about when you like to eat apples?
AI-Aphasic Apple. Red Breakfast?
Re-Agent You enjoy eating red apples for breakfast, that's good! What else do you eat with apples at breakfast?
AI-Aphasic Apple. Apple. Toast.
Re-Agent You have apples with toast for breakfast, that sounds tasty! How do you prepare your apples?
AI-Aphasic Apple. Apple. Slice.
Re-Agent You enjoy your sliced apples plain. Let's summarize: You like red apples because they are sweet. You enjoy them sliced for breakfast with toast. Thank you for sharing with me!

Table A2.

Error type: Word error (Syntactic error) and Topic: Favorite leisure activity.

AI Agent Dialogue
Re-Agent Hi! Let's start with our first topic. Can you tell me what your favorite leisure activity is?
AI-Aphasic Hello. Play soccer.
Re-Agent Playing soccer sounds fun! Can you tell me more about what position you like to play when you play soccer?
AI-Aphasic Playing soccer.
Re-Agent Great! It's enjoyable to play soccer. Can you describe what part of playing soccer you like the most?
AI-Aphasic Midfield run, I yes, run ball chase, I.
Re-Agent You like playing in midfield, where you get to run and chase the ball. That's exciting! Can you tell me about a memorable game you've played in as a midfielder?
AI-Aphasic Go. Game win. We score goal pass. Nice.
Re-Agent Excellent! Can you describe who made the pass and how you scored the goal in that memorable game?
AI-Aphasic A kick strong, goal netball hit. Teammate, teammate pass quick. I kick net.
Re-Agent You felt happy and it was wonderful to share that happiness with your team. Nice work! Now let's move on to the next topic

Table A3.

Error type: Discourse error (Language mixing) and Topic: First car.

AI Agent Dialogue
Re-Agent Hello! Can you tell me about your first car?
AI-Aphasic Carm, I'm good.
Re-Agent Great! You're doing well. Let's focus on your first car. Can you tell me more about it, like the color or model?
AI-Aphasic Red color.
Re-Agent Your first car was red, that's interesting! Can you share what kind of car it was, like the make or model?
AI-Aphasic For Modelo El Coche.
Re-Agent It seems you mentioned “model” and “car” in Spanish. Please respond in English alone. Can you tell me more about the model of your first car? Was it a sedan, an SUV, or another type?
AI-Aphasic L Ford.
Re-Agent Your first car was a Ford. Thank you for sharing! Can you tell me about any memorable experiences you had with your red Ford?
AI-Aphasic Road trip with Amigos. So fun.
Re-Agent That's great! You've done an excellent job describing your first car and the experiences you had.

Funding Statement

This research was partly supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development under Award P2CHD101899—Center for Smart Use of Technology to Assess Real-world Outcomes (C-STAR), with G.C.I. as subproject PI. The study was also supported by an internal grant (CBCS Internal Grant Program award, awarded to G.C.I.).

References

  1. Abbasian, M., Khatibi, E., Azimi, I., Oniani, D., Abad, Z. S. H., Thieme, A., Sriram, S., Yang, Z., Wang, Y., Lin, B., Gevaert, O., Li, L.-J., Jain, R., & Rahmani, A. M. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digital Medicine, 7(1), Article 82. 10.1038/s41746-024-01074-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ahmadi, P., Shafiei, B., Ghasisin, L., & Husak, R. S. (2022). The effects of using video-clip stimuli in response elaboration training in Persian speakers with chronic Broca's aphasia. Aphasiology, 37(7), 1064–1086. 10.1080/02687038.2022.2076280 [DOI] [Google Scholar]
  3. American Speech-Language-Hearing Association. (2015). A closer look at the PhD faculty workforce. The ASHA Leader, 20(8), 32. 10.1044/leader.AAG.20082015.32 [DOI] [Google Scholar]
  4. Baker, C., Rose, M. L., Ryan, B., & Worrall, L. (2021). Barriers and facilitators to implementing stepped psychological care for people with aphasia: Perspectives of stroke health professionals. Topics in Stroke Rehabilitation, 28(8), 581–593. 10.1080/10749357.2020.1849952 [DOI] [PubMed] [Google Scholar]
  5. Botha, N. N., Segbedzi, C. E., Dumahasi, V. K., Maneen, S., Kodom, R. V., Tsedze, I. S., Akoto, L. A., Atsu, F. S., Lasim, O. U., & Ansah, E. W. (2024). Artificial intelligence in healthcare: A scoping review of perceived threats to patient rights and safety. Archives of Public Health, 82(1), Article 188. 10.1186/s13690-024-01414-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://papers.nips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html [Google Scholar]
  7. Bunker, L. D., Nessler, C., & Wambaugh, J. L. (2019). Effect size benchmarks for response elaboration training: A meta-analysis. American Journal of Speech-Language Pathology, 28(1S), 247–258. 10.1044/2018_AJSLP-17-0152 [DOI] [PubMed] [Google Scholar]
  8. Cavanaugh, R., Quique, Y. M., Swiderski, A. M., Kallhoff, L., Terhorst, L., Wambaugh, J., Hula, W. D., & Evans, W. S. (2022). Reproducibility in small-N treatment research: A tutorial using examples from aphasiology. Journal of Speech, Language, and Hearing Research, 66(6), 1908–1927. 10.1044/2022_JSLHR-22-00333 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cong, Y., LaCroix, A. N., & Lee, J. (2024). Clinical efficacy of pre-trained large language models through the lens of aphasia. Scientific Reports, 14(1), Article 15573. 10.1038/s41598-024-66576-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Das, N., Raff, E., & Gaur, M. (2024). Human-interpretable adversarial prompt attack on large language models with situational context. arXiv. 10.48550/arXiv.2407.14644 [DOI]
  11. Efstratiadou, E. A., Papathanasiou, I., Holland, R., Archonti, A., & Hilari, K. (2018). A systematic review of semantic feature analysis therapy studies for aphasia. Journal of Speech, Language, and Hearing Research, 61(5), 1261–1278. 10.1044/2018_JSLHR-L-16-0330 [DOI] [PubMed] [Google Scholar]
  12. Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., & Li, Y. (2024). Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1), Article 1259. 10.1057/s41599-024-03611-3 [DOI] [Google Scholar]
  13. Glosser, G., & Deser, T. (1991). Patterns of discourse production among neurological patients with fluent language disorders. Brain and Language, 40(1), 67–88. 10.1016/0093-934x(91)90117-j [DOI] [PubMed] [Google Scholar]
  14. Karimian, G., Petelos, E., & Evers, S. M. A. A. (2022). The ethical issues of the application of artificial intelligence in healthcare: A systematic scoping review. AI and Ethics, 2(4), 539–551. 10.1007/s43681-021-00131-7 [DOI] [Google Scholar]
  15. Kearns, K. P. (1985). Response elaboration training for patient-initiated utterances. In R. H. Brookshire (Ed.), Clinical Aphasiology (Vol. 15, pp. 196–204). BRK Publishers. https://aphasiology.pitt.edu/854/ [Google Scholar]
  16. Khatri, C., Venkatesh, A., Hedayatnia, B., Gabriel, R., Ram, A., & Prasad, R. (2018). Alexa Prize—State of the art in conversational AI. AI Magazine, 39(3), 40–55. 10.1609/aimag.v39i3.2810 [DOI] [Google Scholar]
  17. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arXiv. https://arxiv.org/abs/2205.11916/
  18. Kumar, A., Agarwal, C., Srinivas, S., Li, A. J., Feizi, S., & Lakkaraju, H. (2023). Certifying LLM safety against adversarial prompting. arXiv. https://arxiv.org/abs/2309.02705v3
  19. Leaman, M. C., & Edmonds, L. A. (2024). Pilot results for ECoLoGiC-Tx: A new conversation-level intervention improving language in people with moderate to severe aphasia. American Journal of Speech-Language Pathology, 33(1), 153–172. 10.1044/2023_AJSLP-23-00141 [DOI] [PubMed] [Google Scholar]
  20. Mariani, M. M., Hashemi, N., & Wirtz, J. (2023). Artificial intelligence empowered conversational agents: A systematic literature review and research agenda. Journal of Business Research, 161, Article 113838. 10.1016/j.jbusres.2023.113838 [DOI] [Google Scholar]
  21. Murray, E., Power, E., Togher, L., McCabe, P., Munro, N., & Smith, K. (2013). The reliability of methodological ratings for speechBITE using the PEDro-P scale. International Journal of Language & Communication Disorders, 48(3), 297–306. 10.1111/1460-6984.12007 [DOI] [PubMed] [Google Scholar]
  22. Petersen, O. H. (2021). Inequality of research funding between different countries and regions is a serious problem for global science. Function, 2(6), Article zqab060. 10.1093/function/zqab060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Pias, T. S., Afrose, S., Tuli, M. D., Trisha, I. H., Deng, X., Nemeroff, C. B., & Yao, D. D. (2025). Low responsiveness of machine learning models to critical or deteriorating health conditions. Communications Medicine, 5(1), Article 62. 10.1038/s43856-025-00775-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Togher, L., Schultz, R., Tate, R., McDonald, S., Perdices, M., Smith, K., Winders, K., & Savage, S. (2009). The methodological quality of aphasia therapy research: An investigation of group studies using the PsycBITETM evidence-based practice database. Aphasiology, 23(6), 694–706. 10.1080/02687030802121353 [DOI] [Google Scholar]
  25. Van Leer, E., & Turkstra, L. (1999). The effect of elicitation task on discourse coherence and cohesion in adolescents with brain injury. Journal of Communication Disorders, 32(5), 327–349. 10.1016/s0021-9924(99)00008-8 [DOI] [PubMed] [Google Scholar]
  26. Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3), 426–482. 10.1090/S0002-9947-1943-0012401-3 [DOI] [Google Scholar]
  27. Wambaugh, J. L., & Martinez, A. L. (2000). Effects of modified response elaboration training with apraxic and aphasic speakers. Aphasiology, 14(5–6), 603–617. 10.1080/026870300401342 [DOI] [Google Scholar]
  28. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2021). Finetuned language models are zero-shot learners. arXiv. https://arxiv.org/abs/2109.01652
  29. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv. https://arxiv.org/abs/2201.11903v6
  30. Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2020). Zero-shot learning––A comprehensive evaluation of the good, the bad and the ugly. arXiv. 10.48550/arXiv.1707.00600 [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data are available upon reasonable request from the corresponding author.


Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES