Abstract
Purpose:
Sentence repetition can contribute to the identification of developmental language disorder (DLD). However, few studies have attempted to optimize the task for clinical practice. This study uses the item response theory (IRT) to optimize a Vietnamese sentence repetition task for screening and full-assessment purposes and evaluate the diagnostic utility of the new item sets.
Method:
We expanded the original task from 28 to 40 items to maximize the chances of having robust final item sets. The 40 items were administered to 196 children in Vietnam, ages 4–6 years. Participants met criteria for DLD (n = 28) or typical development (n = 122), while a subset did not meet criteria for either classification (i.e., Risk, n = 46). Using IRT, we compared different scoring systems and selected item sets with robust parameters and adequate fit to serve two clinical purposes, assessment and screening. We calculated diagnostic accuracy of these item sets using discriminant function analysis and compared results to raw score cut-points.
Results:
The optimal item set for full assessment included 28 items (15 original items) and showed strong diagnostic accuracy, as did a 14-item subset (seven original items) designed for screening. The item set for full assessment also provided a quick characterization of children's grammatical performance. The strongest diagnostic values were derived from discriminant function analysis.
Conclusions:
This study optimized two sentence repetition tasks for monolingual Vietnamese children for use in a full assessment or screening. Implications are discussed on how to utilize tasks in clinical practice. Future studies need to evaluate sentence repetition in older children and bilingual populations.
Supplemental Material:
In recent years, numerous studies have supported the utility of sentence repetition tasks for identifying developmental language disorder (DLD). Across several distinct languages, this simple task—in which a child listens to a sentence and is asked to repeat it verbatim—demonstrates strong diagnostic accuracy for DLD. A recent scoping review (Rujas et al., 2021) found studies of sentence repetition in at least 33 languages and noted that the assessment of different language abilities has been the most common study purpose in this literature. The notable success of this task in discriminating between children with and without DLD across languages suggests that it indexes key underlying impairments in the disorder (Conti-Ramsden et al., 2001; Políšenka et al., 2015; Wang et al., 2022).
From a practical perspective, the development and dissemination of sentence repetition tasks provide clinicians with potentially valuable tools for DLD assessment. However, an initial demonstration that a sentence repetition task is effective falls short of fully optimizing and validating the task. Few published studies have considered crucial next steps in the development of sentence repetition tasks, such as determining whether task cut-points remain consistent in new samples of children or whether all items contribute effectively to the task.
The purpose of the current study is to report these critical next steps in the development of a sentence repetition task in Vietnamese that was originally introduced by G. Pham and Ebert (2020). This study addresses multiple important next steps in the development of this task, including optimizing the item set for two distinct clinical purposes, screening and full assessment, by supplementing the original item set and then evaluating the properties of all items using the item response theory (IRT). We also replicate prior diagnostic accuracy findings in a new sample of monolingual Vietnamese-speaking children and extend findings to a younger age group (4-year-olds). This study provides a rigorous test of the Vietnamese sentence repetition task as well as a model for how validation of newly developed sentence repetition tasks can be continued beyond an initial study.
Sentence Repetition Tasks Across Languages
Sentence repetition (also called sentence imitation or sentence recall) has a long history as a measure of children's language abilities. Over the past two decades, there has been increasing recognition that the task is an excellent clinical marker of DLD (e.g., Conti-Ramsden et al., 2001; Taha et al., 2021; Wang et al., 2022; see Rujas et al., 2021, for a review). Although there is an ongoing discussion of precisely which various underlying abilities contribute to sentence repetition (cf. Frizelle et al., 2017; Políšenka et al., 2015; Riches, 2012), the task clearly measures a combination of skills that are core weaknesses in DLD. More specifically, sentence repetition assesses grammatical knowledge, with likely contributions from phonological memory, working memory, and lexical skills. As children with DLD have clear grammatical deficits, commonly accompanied by limitations in phonological memory, working memory, and lexical skills (Conti-Ramsden et al., 2001; Leonard, 2014), sentence repetition is ideally positioned to identify affected children. In addition, the task's practical properties—including short administration time, minimal technological requirements, and relatively easy scoring—make it particularly well suited to screen children for DLD (Christopulos & Redmond, 2023; Ebert et al., 2020). Finally, sentence repetition may also allow the quick characterization of a child's knowledge of specific grammatical structures in a language (Marinis & Armon-Lotem, 2015).
Empirical demonstrations have repeatedly borne out sentence repetition's ability to identify DLD. These studies determine optimal cut-points and provide diagnostic accuracy measures (i.e., sensitivity, specificity, and positive and negative likelihood ratios) that indicate how well task performance classifies individual participants as having or not having DLD. For example, Redmond et al. (2019) demonstrated that sentence repetition yielded sensitivity and specificity values above 0.80—a conventional cutoff for adequate sensitivity or specificity (Plante & Vance, 1994)—in a sample of 251 monolingual English-speaking children, aged 5 through 9 years. Positive likelihood ratios, which indicate how much value a test score below the cut-point has for confirming DLD status, ranged between 3 and 6 in most analyses; values of 3 or higher are considered moderately informative, whereas values of 10 or higher are needed to be confirmatory (Dollaghan, 2007). Simultaneously, negative likelihood ratios, which indicate how much value a test score above the cut-point has for ruling out DLD, ranged between 0.12 and 0.30 in most analyses of the Redmond et al. sample; values smaller than 0.30 are moderately informative, and values smaller than 0.10 are exclusionary (Dollaghan, 2007). In other words, sentence repetition was found to be a promising task for identifying DLD using rigorous diagnostic accuracy analyses and standards in Redmond et al.'s sample. Later work extended these findings specifically to screenings in real-world environments, including educational (Christopulos & Redmond, 2023) and clinical settings (Ebert et al., 2020).
Similar results have been found in monolingual and bilingual samples of children who speak languages other than English. Following the European initiative, Language Impairment Testing in Multilingual Settings (see Armon-Lotem et al., 2015), tasks were developed in several languages. For example, Taha et al. (2021) describe the development of a sentence repetition task in Palestinian Arabic, along with the task's ability to identify DLD in a group of 90 monolingual children aged 4 through 6 years. Across three different scoring systems, sentence repetition consistently achieved excellent sensitivity and specificity (i.e., values above 0.90; Plante & Vance, 1994). The associated positive and negative likelihood ratios were all highly informative. Taha et al. also summarize prior studies of diagnostic accuracy across languages. For Hebrew, Russian, Danish, Cantonese, Vietnamese, French, and Greek, sentence repetition tasks discriminate between children with DLD and their typically developing (TD) peers. Recently, Wang et al. (2022) added Mandarin to this list; in a sample of 69 monolingual Mandarin-speaking 4- and 5-year-olds, sentence repetition again identified DLD, with a positive likelihood ratio of 8 and a negative likelihood ratio of 0.
Task scoring is important to sentence repetition, as different scoring systems exist and may influence the diagnostic accuracy of the task. Common systems include binary scoring, target structure or core element scoring, error scoring, and proportion scoring (e.g., G. Pham & Ebert, 2020; Taha et al., 2021; Wang et al., 2022). In binary scoring, each sentence receives a 1 if it is repeated verbatim and a 0 if there are any errors in the repetition. In target structure (or core element) scoring, key grammatical elements of the sentence are identified, and only these key elements are considered in scoring; repetition errors outside of the target structure(s) are ignored. Error scoring typically involves assigning an ordinal score based on the number of errors present in the repetition. For example, in one common error scoring scheme, a perfectly repeated sentence earns a score of 2, a sentence with three or fewer errors earns a score of 1, and a sentence with four or more errors earns a score of 0 (Archibald & Joanisse, 2009; Redmond et al., 2019). Finally, proportion scoring systems involve calculating the number of errors in a repetition as a proportion of the length of the stimulus sentence (e.g., percentage of words repeated correctly).
Effective scoring systems for sentence repetition do vary across languages. For example, it is common to count errors by word in English and other European languages. However, in syllabic languages such as Mandarin and Vietnamese, it may be more appropriate to count errors by syllable (for a review of Vietnamese syllable structure, see B. Pham & McLeod, 2016). Considering errors by syllable, G. Pham and Ebert (2020) found that an error scoring system yielded the best combination of sensitivity and specificity for identifying DLD. However, binary scoring provided perfect sensitivity and was also noted to present advantages in ease of scoring. Similarly, Wang et al. (2022) found error scoring using syllables and binary scoring to be optimal scoring systems for Mandarin sentence repetition (in comparison to core element scoring, error scoring using words, or proportion of syllables correct).
Optimizing Sentence Repetition to Identify DLD
In evaluating the sentence repetition literature, it is important to note the purpose of most studies of non-English languages has been to document a new task and provide an initial evaluation of its properties. Though this type of study is a critical first step in establishing a new measure, assessment measures require more than a single study to demonstrate their capabilities. At a minimum, replication of results in an independent sample is needed (Smith & McCarthy, 1995). In the case of sentence repetition, it is possible that diagnostic accuracy values will shift within a new sample, requiring refinement of cut points, or that an alternate scoring system yields superior results in the new sample.
Rigorous demonstration of the diagnostic accuracy of a measure also requires a robust one-gate sample (e.g., Dollaghan & Horner, 2011). One-gate sampling occurs when recruitment and study qualification are consistent across affected and unaffected groups (i.e., children with and without DLD). In two-gate sampling, affected and unaffected groups are recruited separately—such as when affected groups are recruited from clinical caseloads. Two-gate sampling inflates differences between affected and unaffected groups, overestimating diagnostic accuracy (Dollaghan & Horner, 2011; Pawłowska, 2014). Two-gate samples are widespread in studies of sentence repetition (e.g., Pawłowska, 2014; Taha et al., 2021; Wang et al., 2022), and replications with one-gate samples are important to confirm the task's diagnostic power. In addition, most studies establishing the diagnostic accuracy of sentence repetition for DLD have recruited small samples of children with DLD; for example, eight of nine studies summarized in Taha et al. (2021) included 16 or fewer children with DLD. The median overall sample size (i.e., combining children with and without DLD) in this same set of studies is 53, indicating that overall samples are also generally small. Replication of initial sentence repetition results in robust (i.e., large, one-gate) samples is, therefore, an important step in developing new tasks.
In addition to replication in robust samples, new assessment tasks typically benefit from a reevaluation of content after initial testing (Smith & McCarthy, 1995). Task items should be examined to ensure they contribute effectively to the task's intended purpose (e.g., discrimination between children with and without DLD or characterizing knowledge of grammatical structures). Task items that are too easy or too difficult for the population of interest add length to the task without adding value. One approach that offers particular promise in refining items in a task is IRT (Lord & Novick, 1968).
IRT for Task Optimization
Many test development efforts utilize classical test theory, which derives whole-sample measures of correct response rates and overall test scores (Anastasi & Urbina, 1997). Thus, tests developed using classical test theory tend to be limited in their generalizability, as test parameters are tied to the samples in which they were collected. If a test is validated using classical test theory in a sample of individuals who do not closely represent the population of interest, this can result in skewed or biased measurement later (Tucci et al., 2019).
As an alternative to classical test theory, IRT is based on item-level information rather than whole-sample measures. IRT evaluates the relationship between individual responses and item-level properties. This approach offers several benefits in the development and refinement of assessment measures for speech and language (see Baylor et al., 2011; Tucci et al., 2019). Importantly for our purposes, IRT can provide data about item-level error and measurement redundancy (Daub et al., 2019), allowing for the removal of problematic or redundant items to optimize test administration while maintaining a robust and reliable item set. By allowing for discrete measurement of specific item parameters and sources of error at the item level, measures can be shortened without necessarily sacrificing reliability when optimized using IRT methods (e.g., Yuan & Dollaghan, 2020).
In addition to task optimization, IRT offers several models that can be used to evaluate different sentence repetition scoring systems. Here, we focus on two IRT models for binary scoring (Rasch and two-parameter logistic) and one type of IRT model for error scoring (graded response). For binary scoring systems, the Rasch (also referred to as the one-parameter logistic) and two-parameter logistic (2PL) models represent item difficulty as the ability level needed to have a 50% probability of a correct item response. Test takers with higher trait-level ability (e.g., high language ability) will have a higher probability of a correct response (i.e., the item is easier for them), and test takers with lower ability will have a lower probability of a correct response (i.e., the item is more difficult for them). For example, an item with a difficulty of −1.0 would mean that an individual must have an ability of −1.0 on the same scale to have a 50% probability of passing the item. Individuals with ability levels higher than −1.0 would have a higher probability of a correct response, whereas individuals with ability levels below −1.0 would have a lower probability of passing the item. The only difference between the Rasch and 2PL models is the freeing of the item discrimination parameter. In the Rasch model, all items are assumed to have equal discrimination or ability to differentiate between different points along the ability scale. In the 2PL model, the discrimination parameter is free to vary from item to item. Testing for variation in item discrimination can be an important part of determining which items to retain or remove when using IRT for item reduction.
Graded response models (Samejima, 1969, 1996) are used in IRT to evaluate partial scoring systems such as error scoring commonly used in sentence repetition. Like Rasch and 2PL models, item difficulty and person ability (i.e., trait ability) are reported on the same log scale (typically ranging from −3 to 3) with a mean of 0. Difficulty and ability are determined on this scale based on responses of examinees to the items being administered. In the graded response model, items have difficulty levels corresponding to the boundaries, or thresholds, between different response categories (see Embretson & Reise, 2000, for a detailed explanation). Thus, if an item is scored using a 0, 1, 2 partial credit system, there will be threshold-level difficulty parameters for the probability of scoring a 1 instead of a 0 and for scoring a 2 instead of a 1. For example, a sentence repetition item with 0, 1, 2 scoring may have thresholds of −0.50 (to cross the threshold from scoring 0 to 1) and 0.50 (to cross the threshold from scoring 1 to 2), meaning that a test taker must have an ability level of −0.50 to have a 50% probability of scoring 1 and an ability level of 0.50 to have a 50% probability of scoring 2.
Item analysis using each of these methods (Rasch, 2PL, and graded response models) provides useful information in task development and validation phases, specifically regarding which scoring system is most useful and robust. As mentioned previously, the specific scoring system that results in optimal classification accuracy for sentence repetition measures may vary based on multiple factors (see G. Pham & Ebert, 2020; Wang et al., 2022), and exploration of item parameters using IRT models of binary (Rasch and 2PL) and polytomous responses (graded response models) can inform the selection of a scoring system.
Optimizing a Vietnamese Sentence Repetition Task
In an initial study of Vietnamese sentence repetition, G. Pham and Ebert (2020) established its promise for the identification of DLD. G. Pham and Ebert described the design of a 28-item task targeting 10 grammatical forms that were (a) known to be difficult across languages (i.e., ditransitive verbs, object relative clause, subject relative clause, adverbial clauses, passives, and object complement clauses), (b) specific to Vietnamese (i.e., classifiers and aspect), and (c) considered relatively easy to serve as control items (i.e., simple intransitive, simple transitive) following guidelines by Marinis and Armon-Lotem (2015). These 10 grammatical forms were targeted by two to four items each. In a one-gate sample of 104 monolingual kindergarteners living in Vietnam, the task yielded adequate to good diagnostic accuracy values for discriminating between children with DLD (n = 10) and those without DLD (n = 94). In particular, the negative likelihood ratio was perfect (i.e., 0) when using a binary scoring system, which indicates that performance above the cut point was sufficient to rule out DLD in the sample (i.e., confirm the absence of the disorder). This result indicates a strong potential for screening, in which the ability to quickly rule out a disorder is highly valued (e.g., Bao et al., 2024).
However, a single demonstration of diagnostic potential is insufficient. For the purpose of screening, the smallest number of task items that is effective is desirable. For the purpose of a full assessment that characterizes knowledge of different grammatical structures in a language, more items may be desirable; however, they should still accurately measure ability. To continue the process of establishing the sentence repetition task, we sought to refine the task content via IRT and to replicate diagnostic accuracy findings in an independent one-gate sample.
The current study first expanded the original 28 items from G. Pham and Ebert (2020) to maximize the chances that there would be a sufficient number of items to build both a screening and assessment measure after culling weak or problematic items. As shown in the top of Figure 1, we first rephrased one item from the original set to be more positive (i.e., changed an item translated as “The students were disciplined by the teacher” to “The students are praised by the teacher”). We removed one of the grammatical forms originally used for control (two items with simple transitive) and replaced it with a grammatical form known to be difficult for children with DLD across languages, negation (e.g., Dai et al., 2022). The revised set of 40 items consisted of 10 grammatical forms, each targeted by four items.
Figure 1.
From original items to final item sets for full assessment and screening. IRT = item response theory.
We then worked to identify the best items within the set. There is a significant need for valid, reliable, and efficient screening measures for DLD that can be utilized across clinical and linguistic contexts (see Bao et al., 2024). By their nature, screeners must be quick and efficient enough to fit into realistic timeframes for clinical implementation (Ebert et al., 2020) while still providing enough evidence to indicate whether full-scale comprehensive assessment is necessary. However, beyond efficiency, the psychometric properties of screeners and diagnostic assessments are largely the same, with a specific need and focus on diagnostic accuracy (i.e., sensitivity and specificity).
This study addressed the following research questions:
Does binary or error scoring yield better measurement of ability in the sample?
Using the best scoring system, which items in the Vietnamese sentence repetition task provide the most robust measurement of grammatical ability for monolingual Vietnamese-speaking children with and without DLD? How do children with and without DLD perform on these items?
Does the full set of best items have sufficient diagnostic accuracy to serve as an assessment tool?
Is there a subset of the best items that has sufficient diagnostic accuracy to serve as an optimized screening tool?
The overall goal in addressing these research questions is task optimization to promote high diagnostic utility for two distinct clinical purposes: full assessment and screening. Research Question 1 compares different IRT models to select a scoring system with the most reliable and informative parameters. Research Question 2 evaluates item-level properties of difficulty and discrimination for item selection. Based on the refined item set, children's grammatical performance will be compared across TD and DLD groups. Research Questions 3 and 4 evaluate the utility of item sets for different assessment purposes, considering the diagnostic accuracy of each item set.
Method
Data collection for this study occurred in Hanoi, the capital of Vietnam. The Hanoi dialect is considered the “standard” and is one of the northern regional dialects in the country (see B. Pham & McLeod, 2016, for a review). In Vietnam, the field of speech-language pathology continues to develop (Eitel et al., 2017), and speech-language assessment and treatment are not yet widely available. Kindergarten in Vietnam is part of preschool programming and is not compulsory (Vietnam National Assembly, 2005). The focus of preschool programming in Vietnam is overall child development. Classroom teachers serve as the primary facilitators of speech-language development and have been found to be reliable reporters of children's skills (G. Pham et al., 2019).
Recruitment Procedure
This study was approved by the first author's university institutional review board (HS-2021-0009). This study employed a one-gate design in which all participants were recruited from the same schools, and TD and DLD classification was based on the same reference measures. Consistent with child language studies that have employed stratified sampling (e.g., Schneider et al., 2005), we asked teachers to refer three children in their classroom in the upper academic achievement level, three in the middle academic level, and six children in the lower academic level (to overrecruit children at risk for DLD). We provided teachers with recruitment packets and asked for the packets to be sent home with students using their own judgments of the child's level (i.e., high, middle, or low). Families who returned their packets were then contacted by the research team directly. If a family did not return their packet, the teacher would send the packet to a different family within the same level. Teachers did not share their initial judgments with the research team so as not to influence task administration. A total of 205 children were recruited from 20 classrooms across the two collaborating schools. We were able to reach the target total enrollment with stratified sampling because of the trusting relationship established between the research team and the schools and teachers (for information about this international collaboration, see G. Pham, 2023). Parents and teachers provided written consent to participate, and children provided verbal assent.
Participants
We recruited a total of 205 children and their parents and teachers to participate in the study. Child participants completed the Primary Test of Nonverbal Intelligence (PTONI; Ehrler & McGhee, 2008) to measure their nonverbal intelligence. Children who had a standard score of less than 70 on the PTONI were excluded from our study in accordance with international conventions for DLD classification (cf. CATALISE report; Bishop et al., 2016). Additionally, we excluded children with less than 80% exposure to Vietnamese, per parent report, to focus on monolingual development. Also, children who had no response to language sample prompts even after multiple attempts were excluded because language sampling measures were critical for DLD classification (cf. G. Pham et al., 2019). Based on this preliminary information, nine children were excluded from the study (n = 3 with bilingual exposure, n = 5 with PTONI scores below 70, n = 1 showed no response to language sampling attempts). This resulted in a sample of 196 children, aged 3;8–6;1 (years;months; M = 5.1, SD = 0.6), who completed direct language assessments.
DLD Classification
Children with DLD must demonstrate clinical deficits in language skills (Bishop et al., 2017). In the absence of a single standardized measure of Vietnamese, we established language deficits using a multidimensional approach previously used with a separate sample of Vietnamese children (G. Pham et al., 2019). DLD classification was based on a combination of teacher report, parent report, and direct measures of children's language skills. Teachers and parents completed a Vietnamese adaptation (G. Pham et al., 2019) of the Instrument to Assess Language Knowledge (ITALK; Peña et al., 2014) in which they reported on children's skills in five areas (speech, vocabulary, sentence production, grammar, and listening comprehension) using 5-point rating scales. As part of the ITALK, they also completed a yes/no question on whether they had any concern about the child's language or language learning. Parent and teacher report yielded two possible indicators of language difficulty per reporter: low ITALK mean, defined as a z score of 1 SD or more below the group mean, and a positive response to the question of language concern. A positive result from any of the four resulting indicators was considered evidence of parent concern or teacher concern regarding language.
Children completed direct language measures of vocabulary, grammar, and narratives in expressive and receptive modalities. Expressive and receptive vocabulary were indexed using picture naming and picture identification tasks that were created and validated in prior work with Vietnamese children (G. Pham et al., 2019; G. Pham & Snow, 2021). In picture naming, children saw a line drawing on a computer screen and were asked to name it. In picture identification, children saw a 2 × 2 array of line drawings, heard a word, and were asked to point to the corresponding picture. Each task consisted of 60 items (nonoverlapping) that ranged from low- to high-frequency objects in randomized order.
Grammatical and narrative measures were based on the Vietnamese version (Trinh et al., 2020) of the Multilingual Assessment Instrument for Narratives (MAIN; Gagarina et al., 2019). All four MAIN stories were used in the protocol: two stories for story retell (Cat, Dog) and two stories for story tell (Baby Birds, Baby Goats). Children completed one retell followed by one tell, counterbalancing across stories. Children's stories were audio-recorded and transcribed into Systematic Analysis of Language Transcripts (SALT) software (Miller et al., 2019). All transcripts were reviewed by a second rater for accuracy and modified c-unit segmentation (cf. Ebert & Pham, 2017). Children's retell and tell transcripts were merged into one SALT file to increase the total number of utterances available for analysis and to capture children's grammatical and narrative skills across story elicitation procedures. Responses to story comprehension questions were transcribed in SALT for later scoring, and these files were separate from the story retell/tell transcripts.
Grammatical measures included mean length of utterance (MLU) to index sentence length and percentage of grammatical utterances (PGU) to index grammaticality. MLU calculations were based on syllables, consistent with previous language sample analyses with Vietnamese-speaking children (e.g., G. Pham et al., 2019). We used a grammaticality coding system developed for Vietnamese (e.g., Dam et al., 2020) in which a single grammatical error in an utterance would make the entire utterance ungrammatical. PGU was calculated as the total number of grammatical utterances divided by the total number of utterances, multiplied by 100. Following SALT guidelines, incomplete utterances and utterances with mazes or unintelligible word(s) were excluded from the analysis. Narrative macrostructure measures were based on the MAIN manual: story score (Section A) and responses to the story comprehension questions. Story scores were calculated as the number of story elements produced (across retell and tell) divided by the total number of story elements possible. Similarly, story comprehension questions were calculated as the number of correct responses divided by the total number of questions across retell and tell samples.
DLD classification was based on z scores from the parent report, teacher report, and direct language measures. Recall that we recruited a higher proportion of children judged by teachers to have low academic levels (six from each classroom compared to three high and three middle levels) in order to overrecruit DLD. If the z scores were based on the entire sample, overrecruitment in the lower range may have resulted in lower-than-expected group means and/or wider standard deviations. Thus, z scores were based on a subset of participants with an even number of participants representing high, mid, and low skill levels. Because teachers did not share with the research team their initial judgments (of low/medium/high academic levels), we used the subsequent ITALK scores from teachers to rank children within each classroom and then randomly selected an even number of high-, middle-, and low-ranking children from each classroom to form a subgroup for z-score calculation. The means and standard deviations of this subgroup (n = 156) were employed to calculate z scores for the entire sample (N = 196).
We classified children as TD (n = 122, 55 boys and 67 girls) if (a) children had no indicators of parent or teacher concern and only a single direct language measure (out of six) with a corresponding z score at or below −1 or (b) children had concern by one rater (either parent or teacher), yet had average performance (i.e., within 1 SD from the mean) on all six language measures. DLD classification (n = 28, 16 boys and 12 girls) required parent and/or teacher concern and two or more direct language measures with z scores at or below −1. Consistent with multidimensional approaches to diagnosing DLD (G. Pham et al., 2019; Tomblin et al., 1996), low z scores on the direct language measures needed to span at least two different domains of vocabulary, grammar, and narratives. Some children did not meet criteria for either TD or DLD classification (i.e., Risk, n = 46, 24 boys and 22 girls). The Risk group was included in the IRT analyses but excluded from diagnostic accuracy calculations that required binary groupings (DLD or not). Table 1 displays descriptive statistics of the full sample divided into the DLD, Risk, and TD groups.
Table 1.
Participant characteristics.
| Measure | TD | Risk | DLD | Total |
|---|---|---|---|---|
| n | 122 | 46 | 28 | 196 |
| Female/male | 67/55 | 22/24 | 12/16 | 101/95 |
| Age | 5;1 (0;7) | 5;1 (0;7) | 4;11 (0;6) | 5;1 (0;7) |
| PTONI | 114.16 (17.95) | 109.20 (16.83) | 104.82 (18.80) | 111.66 (18.06) |
| Maternal education | 4.38 (0.74) | 4.37 (0.77) | 4.21 (0.74) | 4.35 (0.75) |
| ITALK—Teacher | 0.30 (0.86) | −0.83 (0.93) | −1.05 (1.06) | −0.16 (1.08) |
| ITALK—Parent | 0.25 (0.69) | −0.43 (1.35) | −0.64 (1.17) | −0.04 (1.02) |
| Picture identification | 0.34 (0.63) | −0.27 (0.84) | −1.20 (1.40) | −0.02 (0.99) |
| Picture naming | 0.24 (0.72) | −0.15 (1.12) | −0.96 (1.17) | −0.02 (0.98) |
| PGU | 0.25 (0.74) | −0.12 (0.97) | −1.20 (1.46) | −0.04 (1.05) |
| MLU | 0.25 (0.87) | −0.37 (0.86) | −1.25 (1.01) | −0.11 (1.03) |
| Story score | 0.33 (0.86) | −0.41 (1.08) | −1.18 (0.86) | −0.06 (1.06) |
| Story comprehension | 0.25 (0.77) | −0.39 (1.01) | −0.96 (1.20) | −0.07 (1.00) |
Note. The table displays the means and standard deviations (in parentheses). Age is reported as years;months. PTONI is reported as a standard score. Maternal education is reported as the highest level of education on a scale from 1 to 6 (1 = less than high school; 2 = high school; 3 = some college; 4 = college; 5 = master's; 6 = doctorate). Remaining variables are reported as z scores. As described in the text, DLD classification was based on parent and teacher report (e.g., ITALKs or positive response to concern question) and six direct language measures (last six measures displayed here). TD = typically developing; DLD = developmental language disorder; PTONI = Primary Test of Nonverbal Intelligence; ITALK = Instrument to Assess Language Knowledge; PGU = proportion of grammatical utterances; MLU = mean length of utterance.
Task Stimuli and Scoring Systems
In the sentence repetition task, children heard an audio-recorded sentence and were asked to repeat it verbatim. Each of the following grammatical forms were targeted with four items: intransitive verbs, ditransitive verbs, aspect, passives, classifiers, adverbial clauses, object complement clauses, object relative clauses, subject relative clauses, and negation.
We utilized two systems to score the sentence repetition task: binary, in which sentences received a score of 1 if repeated perfectly and a score of 0 if one or more errors were present, and error, in which sentences received a score of 2 if repeated perfectly, 1 if one to three errors were present, and 0 if four or more errors were present. Scoring was completed by trained research assistants fluent in Vietnamese. Interrater reliability was conducted on 100% of the samples to ensure scoring accuracy. Two raters first scored each item independently. Then, they discussed the discrepancies in their scoring and updated the score after they reached a consensus or until consensus could not be reached (i.e., true disagreement). In the initial (independent) scoring, point-by-point reliability between the first and second raters was 90% for binary scoring and 93% for error scoring. After discrepancies were discussed, the point-by-point reliability increased to 92% and 95% for binary scoring and error scoring, respectively. For unresolved discrepancies, we relied on the scores completed by the first rater.
Analyses
All analyses were completed using R statistical software (R Core Team, 2023). To address Research Questions 1 and 2, IRT analyses were conducted using the mirt package (Chalmers, 2012). We examined four core indicators—unidimensionality, model fit, item parameters, and item fit—in a stepwise process to determine which items had strong parameters for retention. Unidimensionality is evidence that each item is measuring a single latent construct (i.e., language ability). For each IRT model that was fit, we used the factor analytic argument in the mirt package to yield factor loadings for each item and total variance explained. For the purpose of retaining only the items representing the strongest measurement of the underlying construct, we set a conservative cutoff of factor loadings of 0.6 or greater for retention. Given that loadings of 0.6 or greater indicate a strong correlation to the latent variable (DeVellis, 2017), we believed that this cutoff would yield a subset of items that were the best representation of language ability.
We then examined model fit to determine which scoring system best fit the data: binary scoring with equal discrimination (Rasch), binary scoring with varied discrimination (2PL), or error scoring (graded response models [GRMs]). As mentioned previously, both Rasch and 2PL models are used with binary response data; however, the Rasch model sets discrimination to be equal across all items, whereas discrimination is free to vary in the 2PL model. In the 2PL and GRM models, issues with item discrimination (i.e., discrimination < 1.0) were also considered as reasons for removal of items from the set. Model and item fit were determined using chi-square difference tests (χ2) for exact fit and root-mean-square error of approximation (RMSEA) and comparative fit index (CFI) for approximate fit. A significant χ2 result indicates poor model fit, while a nonsignificant result indicates good fit. RMSEA values < 0.05 and CFI values > 0.90 are considered indicative of good approximate model and item fit.
For the error scoring system (0, 1, 2), we fit GRMs for the whole sample and for each group (DLD and TD). Multiple-group models were fit to ensure items being retained demonstrated expected patterns of difficulty for groups with and without DLD, as items should be more difficult for children with DLD than for TD children. GRMs provide difficulty parameters at each scoring threshold: In the case of the sentence repetition items, this means the difficulty of scoring 1 over 0 and of scoring 2 over 1.
For both binary scoring models and the GRM fit with the whole sample, all participants were included: those with DLD, those who were classified as TD, and those in the Risk group. Although participants in the Risk group could not be assigned a definitive diagnostic status, their item performance reflects that of children likely to be encountered in the clinical population, and thus, their data were maintained for the purpose of estimating item parameters and model fit. The multigroup models were fit using only the scoring data from participants who clearly met criteria for either DLD or TD to maximize our ability to make decisions regarding which items to retain based on the relative performance of our definitive groups. The Risk group was not included in multigroup modeling because it is possible that children in this group could belong to either the DLD or TD group, but we did not have sufficient information to categorize their ability and thus their relative item performance is less informative on its own for making item retention decisions.
The model best fitting the data was used to derive item parameters and item fit statistics. We considered the amount of information each item yielded at various points along the ability scale, using item information functions as an empirical indicator of each item's relative contribution to ability estimates (Hambleton et al., 1991). Since our goal is to differentiate between children with and without DLD, we want to have the greatest amount of measurement information clustered around the average-to-below-average range to allow for ample opportunity to differentiate between children who do and do not have DLD. Items yielding information significantly outside of this range of ability (e.g., a very easy item with a difficulty < −3.99) were considered for removal from the set.
When the final assessment item set (i.e., the largest set of robust items) was derived, we then selected the smallest subset of items with the highest factor loadings for the TD group to serve as the screening measure. We used the factor loadings for the TD group as the basis for selection to ensure that the items selected were those representing the strongest measurement of the underlying language ability construct for TD children. This will then make it more likely that we are able to identify the ways in which children with DLD's language performance differs from that of the TD group.
Research Questions 3 and 4 were addressed with discriminant function analysis using the candisc package (Friendly & Fox, 2021). Discriminant analysis assigns weighted scores to each item, which are then summed to assign an overall discriminant score to each examinee. We found the optimal cut-point for discriminant scores by finding the midpoint between the highest discriminant score for an examinee with previously identified DLD and the lowest discriminant score for an examinee with typical language ability as indicated by previous testing.
Finally, cut-score optimization was completed using the OptimalCutpoints package (Lopez-Raton et al., 2014) to determine the diagnostic accuracy of the measures using raw scores. While discriminant analysis is a robust empirical method of determining diagnostic accuracy, the calculations involved in determining and applying discriminant score cut-points may serve as a barrier to widespread clinical implementation. For this reason, we sought to determine the optimal cut-score using raw scoring data obtained using the default scoring systems of the sentence repetition items.
Both the discriminant function analysis and the cut-score optimization analysis were completed using only the scoring data from participants clearly classified as either TD or DLD. Participants in the Risk group were excluded from diagnostic accuracy calculations as their diagnostic status could not be confirmed a priori.
Results
Unidimensionality: All Models
Table 2 contains factor loading and total variance explained for each item in each model that was fit. The Rasch model using the whole sample of participants indicated that all items likely meet the unidimensionality assumption (i.e., loadings > 0.6) and thus are measuring one construct. The proportion of variance explained by the Rasch model was estimated at 0.506. However, because all items are assumed to be equally discriminating in the Rasch model, all factor loadings are fixed to be equal as well. Thus, we also examined loadings in the 2PL and GRMs in which discrimination and loadings were free to vary.
Table 2.
Factor loadings and model fit indices.
| Grammatical form | Item | Rasch |
2PL |
GRM |
||||
|---|---|---|---|---|---|---|---|---|
| Whole sample | Whole sample | DLD | TD | Whole sample | DLD | TD | ||
| Simple intransitive sentence | 1 | 0.712 | 0.698 | 0.917 | 0.593 | 0.733 | 0.695 | 0.695 |
| 2 | 0.712 | 0.549 | 0.826 | 0.411 | 0.628 | 0.730 | 0.441 | |
| 3 | 0.712 | 0.732 | 0.920 | 0.677 | 0.739 | 0.852 | 0.675 | |
| 4 | 0.712 | 0.773 | 0.847 | 0.709 | 0.740 | 0.740 | 0.703 | |
| Ditransitive verb | 5 | 0.712 | 0.636 | 0.902 | 0.490 | 0.636 | 0.733 | 0.488 |
| 6 | 0.712 | 0.635 | 0.482 | 0.553 | 0.624 | 0.585 | 0.543 | |
| 7 | 0.712 | 0.744 | 0.920 | 0.625 | 0.775 | 0.932 | 0.633 | |
| 8 | 0.712 | 0.702 | 0.970 | 0.581 | 0.753 | 0.980 | 0.636 | |
| Aspect | 9 | 0.712 | 0.847 | 0.870 | 0.787 | 0.823 | 0.839 | 0.761 |
| 10 | 0.712 | 0.644 | 0.729 | 0.591 | 0.700 | 0.690 | 0.731 | |
| 11 | 0.712 | 0.673 | 0.893 | 0.545 | 0.699 | 0.902 | 0.620 | |
| 12 | 0.712 | 0.709 | 0.945 | 0.487 | 0.687 | 0.783 | 0.499 | |
| Passive | 13 | 0.712 | 0.701 | 0.953 | 0.702 | 0.751 | 0.870 | 0.727 |
| 14 | 0.712 | 0.844 | 0.886 | 0.774 | 0.795 | 0.741 | 0.730 | |
| 15 | 0.712 | 0.750 | 0.910 | 0.683 | 0.760 | 0.829 | 0.694 | |
| 16 | 0.712 | 0.670 | 0.921 | 0.476 | 0.722 | 0.924 | 0.470 | |
| Classifier | 17 | 0.712 | 0.823 | 0.998 | 0.732 | 0.835 | 0.986 | 0.744 |
| 18 | 0.712 | 0.671 | 0.870 | 0.591 | 0.741 | 0.881 | 0.638 | |
| 19 | 0.712 | 0.757 | 0.969 | 0.660 | 0.792 | 0.946 | 0.649 | |
| 20 | 0.712 | 0.733 | 0.982 | 0.586 | 0.738 | 0.868 | 0.614 | |
| Adverbial clause | 21 | 0.712 | 0.747 | 0.999 | 0.708 | 0.751 | 0.824 | 0.727 |
| 22 | 0.712 | 0.637 | 0.701 | 0.554 | 0.679 | 0.660 | 0.615 | |
| 23 | 0.712 | 0.647 | 0.611 | 0.652 | 0.709 | 0.577 | 0.733 | |
| 24 | 0.712 | 0.765 | 0.940 | 0.630 | 0.775 | 0.784 | 0.677 | |
| Object complement clause | 25 | 0.712 | 0.814 | 0.693 | 0.819 | 0.798 | 0.714 | 0.794 |
| 26 | 0.712 | 0.753 | 0.973 | 0.652 | 0.717 | 0.774 | 0.625 | |
| 27 | 0.712 | 0.823 | 0.975 | 0.702 | 0.811 | 0.976 | 0.721 | |
| 28 | 0.712 | 0.812 | 0.740 | 0.776 | 0.773 | 0.808 | 0.784 | |
| Relative clause: Object | 29 | 0.712 | 0.800 | 0.911 | 0.646 | 0.764 | 0.790 | 0.651 |
| 30 | 0.712 | 0.436 | 0.776 | 0.253 | 0.546 | 0.762 | 0.291 | |
| 31 | 0.712 | 0.599 | 0.843 | 0.450 | 0.652 | 0.750 | 0.555 | |
| 32 | 0.712 | 0.717 | 0.820 | 0.662 | 0.735 | 0.782 | 0.702 | |
| Relative clause: Subject | 33 | 0.712 | 0.816 | 0.998 | 0.764 | 0.783 | 0.916 | 0.708 |
| 34 | 0.712 | 0.759 | 0.816 | 0.678 | 0.810 | 0.896 | 0.731 | |
| 35 | 0.712 | 0.706 | 0.603 | 0.569 | 0.719 | 0.838 | 0.643 | |
| 36 | 0.712 | 0.785 | 0.939 | 0.765 | 0.779 | 0.763 | 0.760 | |
| Negation | 37 | 0.712 | 0.629 | 0.899 | 0.541 | 0.689 | 0.788 | 0.648 |
| 38 | 0.712 | 0.712 | 0.902 | 0.627 | 0.701 | 0.849 | 0.636 | |
| 39 | 0.712 | 0.682 | 0.745 | 0.629 | 0.748 | 0.769 | 0.714 | |
| 40 | 0.712 | 0.718 | 0.900 | 0.776 | 0.746 | 0.720 | 0.799 | |
| Variance explained | 0.506 | 0.520 | — | — | 0.542 | — | — | |
| Model fit indices | ||||||||
| χ2 | 279.57 | 260.75 | — | — | 478.47 | — | — | |
| df | 275 | 252 | — | — | 432 | — | — | |
| p | .412 | .339 | — | — | .06 | — | — | |
| RMSEA | 0.01 | 0.01 | — | — | 0.02 | — | — | |
| CFI | 0.99 | 0.99 | — | — | 0.99 | — | — | |
Note. Italicized text indicates < 0.60 threshold for retention (i.e., violation of unidimensionality). Bolded rows indicate items which met unidimensionality in the graded response models (GRMs). Underlined text indicates the model with the highest variance explained for all participants and corresponding fit indices for this model. Fit indices are not provided for multigroup models as these are not informative for our purposes given convergence issues arising from fitting complex item response theory models with small sample sizes. 2PL = two-parameter logistic models; DLD = developmental language disorder; TD = typically developing; RMSEA = root mean squared error of approximation; CFI = comparative fit index.
The whole-sample 2PL model indicated that three items (see Table 2, italicized values) had loadings below the 0.6 threshold for strong evidence of unidimensionality. The proportion of variance explained by the whole-sample 2PL model was estimated at 0.520. In the multigroup 2PL model, one item for the DLD group and 16 items for the TD group (see Table 2) had loadings below the 0.6 threshold.
In the whole-sample GRM, one item had a loading below the 0.6 threshold. The proportion of variance explained by the whole-sample GRM was estimated at 0.542, indicating that the error scoring system explained the most variance in language ability of the three whole-sample models tested. In the multigroup GRM, two items for the DLD group and seven items for the TD group had loadings below the 0.6 threshold.
To summarize, 16 items violated the unidimensionality assumption in the 2PL models, while only eight items violated unidimensionality in the GRMs. This leaves 24 items in binary scoring or 32 items in error scoring to carry forward into the next step of the analysis. In the next step, model fit and item parameters are considered for whole-sample models only given that multigroup models may not converge for small sample sizes, particularly for the DLD group (n = 28).
Binary Scoring: Model Fit and Item Parameters
As shown in Table 2, approximate model fit statistics for the Rasch model indicated acceptable fit: RMSEA = 0.01, CFI = 0.99. Exact model fit statistics indicated that the Rasch model with the 24 items meeting the unidimensionality assumption has acceptable fit for these data: χ2(275) = 279.57, p = .412. Individual item fit statistics in the Rasch model can be found in Supplemental Material S1. Five items had unacceptable approximate fit, and two had unacceptable exact fit. Supplemental Material S2 contains item parameters for the Rasch model fit using the whole sample of participants and the 24 items meeting the unidimensionality assumption. One item had inflated difficulty (3.58) and thus yielded maximum information outside of the target ability range.
In the whole-sample 2PL model, approximate model fit statistics indicated acceptable fit: RMSEA = 0.01, CFI = 0.99 (see Table 2). Similar to the Rasch model, exact model fit statistics indicated that the 2PL model with 24 unidimensional items has acceptable fit for these data: χ2(252) = 260.75, p = .339. Twenty-two of the 24 items in the whole-sample 2PL model had acceptable fit (see Supplemental Material S1). See Supplemental Material S2 for the item parameters for the whole-sample 2PL model. Most items yielded maximum information within the target range of ability, but the scale overall yielded maximum information around the midpoint of the ability scale (−0.50 to 0.50), with less information on the lower end of the ability scale where children with higher degrees of language difficulty may fall.
Approximate fit statistics indicated acceptable fit for the whole-sample GRM model (Table 2): RMSEA = 0.02, CFI = 0.99. Exact model fit statistics also indicated that the GRM with 32 unidimensional items is a good fit for these data: χ2(432) = 478.47, p = .06. This suggests that error scoring has acceptable fit for the data. Item fit statistics for the whole-sample GRM indicated acceptable fit for 28/32 items in the set (see Supplemental Material S3). All items yielded maximum information within the target range of ability (see Supplemental Material S4).
Summary of Model Fit, Item Fit, and Parameters
Given the level of variance explained and the overall scale information, the GRM is the best fitting model for these data. Addressing Research Question 1, this indicates that error scoring should be the scoring system used with this set of items. Of the 40 items initially tested, 32 met the unidimensionality assumption in the GRM analysis. Four of these items had poor fit at the item level for the whole sample and were removed from the next step of the analysis. Addressing Research Question 2, 28 items had acceptable fit in the whole-sample GRM and robust item parameters (see Figure 1). We carried forward these 28 items to build both the optimized assessment measure and optimized screening measure and to test diagnostic accuracy.
Table 3 contains the 28 items that were carried forward to create the assessment and screening measures. Figure 2 displays average item scores, grouped by target grammatical form, across TD, Risk, and DLD groups. Two points is the maximum score per item. On average, the TD group showed high accuracy (defined here as > 1.5 points out of 2) on object complement, classifier, passive, aspect, ditransitive, and simple intransitive items and relatively lower accuracy (defined as 1–1.5 points) on negation, relative subject, relative object, and adverbial clause items.
Table 3.
Composition of 28-item assessment.
| Item | Grammatical form | Vietnamese sentence | Approximate translation to English |
|---|---|---|---|
| 1 | Simple intransitive | Các bạn vui vẻ chạy đến trường học.* | The friends happily ran to school. |
| 3 | Simple intransitive | Con mèo trắng nằm ngủ ở trên cỏ. | The white cat sleeps on the grass. |
| 4 | Simple intransitive | Con chim màu xanh bay lên cành cây. | The blue bird flies onto the tree branch. |
| 7 | Ditransitive | Bà kể cho em một câu chuyện vui. | Grandma tells me a happy story. |
| 8 | Ditransitive | Bà đưa một bình sữa cho em bé. | Grandma gives the baby a bottle of milk. |
| 9 | Aspect | Nhiều lá trên cây đang rơi xuống đất. * | Many leaves on the tree are falling to the ground. |
| 10 | Aspect | Ông nội đã sơn xong cái nhà rồi. * | Grandfather has already finished painting the house. |
| 11 | Aspect | Hai chị em đang ở nhà bạn chơi.* | The two sisters are playing at their friend's house. |
| 13 | Passive | Các em học sinh được cô giáo khen. | The students are praised by the teacher. |
| 14 | Passive | Em bé được mẹ đưa về quê chơi. * | The baby is brought by mother to the countryside. |
| 17 | Classifier | Chị chọn cho em một cái áo vàng. * | Big sister chose for me one yellow shirt. |
| 18 | Classifier | Bà mua cho con bốn đôi giày mới.* | Grandma buys for me four pairs of new shoes. |
| 19 | Classifier | Năm con vịt nhỏ đang bơi dưới nước.* | Five small ducks are swimming in the water. |
| 22 | Adverbial clause | Chị lo lắng vì em bị đau chân.* | Big sister worries because I hurt my leg. |
| 24 | Adverbial clause | Tuy/Dù em ốm/bệnh, em vẫn phải đi học.* | Though I am sick, I still have to go to school. |
| 25 | Object Complement | Con chó nhỏ biết cô chủ sắp về. * | The small dog knows that his owner is almost home. |
| 26 | Object Complement | Cô giáo cho biết mùa xuân đến rồi.* | The teacher announces that spring is here. |
| 27 | Object Complement | Em muốn con thỏ của em mau lớn. | I want my bunny to grow up quickly. |
| 29 | Relative object | Em thích cái hình chị vẽ con thỏ.* | I like the picture that she drew of the rabbit. |
| 32 | Relative object | Cô chụp ảnh/hình em chơi với các bạn. | Teacher took a picture of me playing with friends. |
| 33 | Relative subject | Cà chua ông trồng trong vườn chín rồi. * | The tomatoes grandfather planted in the garden are now ripe. |
| 34 | Relative subject | Con chó em mua hôm qua đang ngủ. * | The dog I bought yesterday is sleeping. |
| 35 | Relative subject | Con gà em nuôi đẻ ba quả/trái trứng. | The chicken I raised laid three eggs. |
| 36 | Relative subject | Cái áo mẹ mua có hình con gấu. | The shirt mother bought has a picture of a bear. |
| 37 | Negation | Con chó của em ở nhà không sủa. | My dog at home does not bark. |
| 38 | Negation | Sáng nay cô giáo không mang theo ô/dù. | This morning, teacher did not bring the umbrella. |
| 39 | Negation | Cây cam ở trong vườn chưa ra trái. | The orange tree in the garden has not yet bloomed. |
| 40 | Negation | Ông bà chưa cho con tiền lì xì. | The grandparents have not yet given me lucky money. |
Note. Bolded rows are items included in a 14-item screener. Words that are preceded by a slash are alternates to use for the southern dialect.
Items from the original item set (G. Pham & Ebert, 2020).
Figure 2.
Average raw score (2 maximum) for grammatical targets by DLD, Risk, and TD groups. DLD = developmental language disorder; TD = typically developing.
As expected, the DLD group showed lower accuracy than the TD group across all grammatical forms. On average, the DLD group showed relatively high accuracy (> 1 point) on negation, classifier, passive, aspect, ditransitive, and simple intransitive and low accuracy (< 1 point) on relative subject, relative object, object complement, and adverbial clause. Performance of the Risk group seemed to fall in between the TD and DLD groups.
Discriminant Function Analysis and Predictive Values
To address Research Question 3 for the full assessment task, we tested the diagnostic accuracy of the 28 retained items using discriminant function analysis. For diagnostic accuracy in all discriminant analyses, we used Plante and Vance's (1994) 80% cutoff for acceptable sensitivity and specificity. Table 4 contains the diagnostic accuracy of both the 28- and 14-item versions. The 28-item version of the assessment demonstrated acceptable sensitivity (85.71%) and specificity (87.70%) at a discriminant score cutoff of 2.738.
Table 4.
Diagnostic accuracy of assessment and screener using raw scores and discriminant scores.
| Variable | 28-item assessment | 28-item assessment | 14-item screener | 14-item screener |
|---|---|---|---|---|
| Score type | Discriminant score | Raw score | Discriminant score | Raw score |
| Cut-point | 2.738 | 39 | 2.779 | 19 |
| Sensitivity | 85.71% | 71.43% | 82.14% | 75.00% |
| Specificity | 87.70% | 76.23% | 82.79% | 72.95% |
| Positive predictive value | 61.53% | 32.67% | 52.27% | 38.89% |
| Negative predictive value | 96.40% | 92.08% | 95.28% | 92.71% |
| Positive likelihood ratio | 6.97 | 3.00 | 4.77 | 2.77 |
| Negative likelihood ratio | 0.16 | 0.37 | 0.22 | 0.34 |
We also sought to identify a smaller set of items that could form an effective screening tool (Research Question 4). To address this purpose, we selected the 14 items from the 28-item version that had the highest factor loadings for the TD group (see Table 2 and Figure 1). As a reminder, this ensures that the screener is composed of items representing the strongest measurement of language ability for TD children, making it likely that children with DLD will demonstrate differential performance that allows for identification. As shown in Table 4, this screener demonstrated acceptable sensitivity (82.14%) and specificity (82.79%) when using a discriminant score cutoff of 2.779. These levels of sensitivity and specificity translate to positive and negative predictive values of 52% and 95%, respectively. The positive predictive value (52%) indicates that a failed screening result likely means that further assessment is necessary. The negative predictive value (95%) indicates that a passing screening result suggests that the child does not need further assessment and likely has typical language abilities. Positive and negative likelihood ratios were 4.77 and 0.22, respectively, indicating informative values for ruling in and ruling out DLD.
In addition to discriminant function analysis, we examined the sensitivity and specificity of the 28-item assessment and the 14-item screener when total raw scores are used. Using the OptimalCutpoints package in R (applying the Youden index for determination), we determined the optimal cut-score using raw score values. Table 4 contains the results of this analysis. For all item sets, using a total-score cut-point results in sensitivity and specificity below acceptable levels (i.e., < 80%), suggesting that the use of discriminant scores for classification is the most robust method of identifying DLD using these tools.
Discussion
The present study reported steps in optimizing a sentence repetition task using IRT. It provides both a set of rigorously validated Vietnamese sentence repetition tasks (for screening and for full assessment) and an example of how initial task validation should be extended. The rigorous design and methodology provide a prototype for studies of DLD assessment. First, we employed a one-gate sampling design in which all child participants were recruited from the same schools and classified using the same reference measures. One-gate sampling avoids some of the pitfalls of two-gate sampling such as the polarization of skills between TD and DLD groups and the inability to detect mild cases of DLD (Pawlowska, 2014). Second, in comparison to most studies of DLD, this study had a large sample overall (N = 196) and a large sample of children with DLD (n = 28). Larger samples allow for a wider representation of the target population as well as more sophisticated statistical approaches such as IRT utilized in the current study. Third, the present study replicated and extended results from G. Pham and Ebert (2020) using an independent sample and a larger item set. Replication with an independent sample is a rigorous test of the diagnostic utility of the sentence repetition task for Vietnamese populations. The larger item set allowed for content reanalysis and set the stage for task optimization (see Figure 1).
Using multiple IRT models, the present study aimed to find the best balance between (a) robust item parameters, (b) diagnostic accuracy, and (c) items measuring all grammatical constructs. Here, we highlight two main results from the IRT analyses. First, factor analytic evidence (i.e., variance explained; see Table 2) and model fit statistics provide empirical evidence that error scoring (GRM) is better for measuring performance on this task. Even though the use of binary scoring may be easier and quicker, error scoring provided the best discrimination between TD and DLD groups. Allowing children to receive partial credit on sentence repetition items, as in the error scoring system, allows clinicians to make more nuanced measurements of language ability than are possible using a binary all-or-nothing scoring system. This can, in turn, improve our ability to make diagnostic decisions because we have more evidence with which to determine the presence or absence of a disorder such as DLD.
Second, we were able to select 28 items for the full test and 14 items for the screening. Using empirical evidence in the form of factor loadings, item fit indices, and item parameters, we were able to systematically derive item sets with high diagnostic accuracy. The full set of 28 items allows for a characterization of children's grammatical performance across all target forms. The 14-item screener still has acceptable diagnostic accuracy and provides clinicians with an efficient, valid measure with which to screen children for DLD.
As depicted in Figure 2, though children with DLD showed lower accuracy compared to their Risk and TD peers, all groups showed similar overall patterns of accuracy. All groups showed highest accuracy on basic grammatical structures such as simple intransitive and ditransitive verbs and lowest accuracy on complex structures such as relative subject clauses. It was also noted that all groups seemed to show higher accuracy on language-specific structures including classifiers and aspect. Consistent with processing-based accounts of DLD (Leonard, 2014), children with DLD can use the linguistic cues available to them in their ambient language, though interactions between these cues and limitations in processing capacity and speed result in lower performance. Cross-linguistically, grammatical forms found to be difficult for all children are particularly challenging for children with DLD (Leonard, 2014).
Clinical Implications
The main goal of this study was to optimize tasks of sentence repetition that could be used for different purposes (full assessment and screening). There are several clinical implications of these results.
First, the discriminant scores obtained through discriminant function analysis yielded better classification accuracy (sensitivity and specificity) than raw scores. This is likely due to the ways in which discriminant analysis assigns varying weights to items reflecting their relative contribution to the overall score. In this way, discriminant scores are more reflective of the dynamic nature of language difficulties secondary to DLD. Individual children with DLD may have difficulty with varying linguistic forms to different degrees. When we look at raw score totals, this variation in ability can often be flattened, resulting in lower accuracy and less ability to separate our clinical groups (DLD or TD). While we recognize that raw scores are easier to derive and more directly accessible for clinical use, the levels of sensitivity and specificity obtained at our most optimal raw score cut-points (see Table 4) were unacceptable. Fortunately, the weights derived through the discriminant functions reported in this article will remain static and can be used to automate discriminant score calculations. Automated discriminant score calculation could be built into a web tool (e.g., VietSLP, 2020) or simple spreadsheet in which clinicians could enter their student's raw scores and receive an immediate diagnostic result for both the screening and assessment tools.
In addition, results of our diagnostic accuracy analysis for the screening tool indicate acceptable levels of sensitivity and specificity, which, in turn, translate to robust positive and negative predictive values and likelihood ratios. The 14-item screener has a positive predictive value of 52.27%, meaning that if we were to give the screener to a population-based sample, approximately 52% of children failing the screener would go on to be identified with DLD following further testing. This indicates that a failed screening result will always necessitate further testing, and approximately half of the students failing the screener will truly have DLD. The 14-item screener has a negative predictive value of 95.28%, meaning that in our population-based sample approximately 95% of children passing the screener are TD. This is encouraging, as clinicians can be highly confident that a passed screening result indicates that a child does not require further testing.
Concluding Remarks
Sentence repetition is a promising tool for identifying DLD across languages. Many studies have provided initial validation of sentence repetition tasks across languages; this study took the next step by optimizing a sentence repetition task for the Vietnamese language. By comparing various IRT models, we selected the most robust items for use in assessment and screening with the goal of maximizing efficiency and diagnostic accuracy. These tools are recommended for use with monolingual Vietnamese children, ages 4–6 years, particularly children who speak a “standard” dialect, as data collection took place in Hanoi, the capital city, located in the northern region of Vietnam. Future studies can extend findings to additional Vietnamese populations including older children and different geographic regions of Vietnam.
Importantly, more study is needed on sentence repetition in Vietnamese bilingual populations, with and without DLD. Bilinguals may perform differently on this task due to cross-linguistic influences and varying levels of Vietnamese language exposure and proficiency (for a review, see G. Pham, 2023). The intersection between language disorder and childhood bilingualism is not yet well understood for Vietnamese speakers. Indeed, not all tasks that accurately measure development and disorder in monolingual populations can readily apply to bilingual populations. Further validation is required to evaluate the utility of these specific tasks as well as the broader context of sentence repetition for Vietnamese bilinguals.
Data Availability Statement
The data set from this study is not publicly available but is available from the corresponding author for research purposes upon request.
Supplementary Material
Acknowledgments
This study was funded by the National Institutes of Health (NIH) Grant R01DC019335 (awarded to the first author). Research reported in this publication was supported by the National Institute on Deafness and Other Communication Disorders Grant R01DC019335 (awarded to Giang Pham). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Support for writing for the second author was provided by an administrative supplement to the parent Grant NIH/NIDCD R01DC018329 (awarded to Elizabeth Peña). We acknowledge our collaborating institution in Vietnam, Hanoi National College for Education, team members from the San Diego State University Bilingual Development in Context Research Lab, Katherine Rhodes, and participating children, parents, and teachers.
Funding Statement
This study was funded by the National Institutes of Health (NIH) Grant R01DC019335 (awarded to the first author). Research reported in this publication was supported by the National Institute on Deafness and Other Communication Disorders Grant R01DC019335 (awarded to Giang Pham). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Support for writing for the second author was provided by an administrative supplement to the parent Grant NIH/NIDCD R01DC018329 (awarded to Elizabeth Peña).
References
- Anastasi, A., & Urbina, S. (1997). Item analysis. In Psychological testing (7th ed., pp. 172–203). Prentice Hall. [Google Scholar]
- Archibald, L. M., & Joanisse, M. F. (2009). On the sensitivity and specificity of nonword repetition and sentence recall to language and memory impairments in children. Journal of Speech, Language, and Hearing Research, 52(4), 899–914. 10.1044/1092-4388(2009/08-0099) [DOI] [PubMed] [Google Scholar]
- Armon-Lotem, S., de Jong, J., & Meir, N. (Eds.). (2015). Assessing multilingual children: Disentangling bilingualism from language impairment. Multilingual Matters. 10.21832/9781783093137 [DOI] [Google Scholar]
- Bao, X., Komesidou, R., & Hogan, T. P. (2024). A review of screeners to identify risk of developmental language disorder. American Journal of Speech-Language Pathology, 33(3), 1548–1571. 10.1044/2023_AJSLP-23-00286 [DOI] [PubMed] [Google Scholar]
- Baylor, C., Hula, W., Donovan, N. J., Doyle, P. J., Kendall, D., & Yorkston, K. (2011). An introduction to item response theory and Rasch models for speech-language pathologists. American Journal of Speech-Language Pathology, 20(3), 243–259. 10.1044/1058-0360(2011/10-0079) [DOI] [PubMed] [Google Scholar]
- Bishop, D. V., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2016). CATALISE: A multinational and multidisciplinary Delphi consensus study. Identifying language impairments in children. PLOS ONE, 11(7), Article e0158753. 10.1371/journal.pone.0158753 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bishop, D. V., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE-2 Consortium. (2017). Phase 2 of CATALISE: A multinational and multidisciplinary Delphi consensus study of problems with language development: Terminology. Journal of Child Psychology and Psychiatry, 58(10), 1068–1080. 10.1111/jcpp.12721 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
- Christopulos, T. T., & Redmond, S. M. (2023). Factors impacting implementation of universal screening of developmental language disorder in public schools. Language, Speech, and Hearing Services in Schools, 54(4), 1080–1102. 10.1044/2023_LSHSS-22-00169 [DOI] [PubMed] [Google Scholar]
- Conti-Ramsden, G., Botting, N., & Faragher, B. (2001). Psycholinguistic markers for specific language impairment (SLI). Journal of Child Psychology and Psychiatry and Allied Disciplines, 42(6), 741–748. 10.1111/1469-7610.00770 [DOI] [PubMed] [Google Scholar]
- Dai, H., He, X., Chen, L., & Yin, C. (2022). Language impairments in children with developmental language disorder and children with high-functioning autism plus language impairment: Evidence from Chinese negative sentences. Frontiers in Psychology, 13, Article 926897. 10.3389/fpsyg.2022.926897 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dam, Q., Pham, G., Potapova, I., & Pruitt-Lord, S. (2020). Grammatical characteristics of Vietnamese and English in developing bilingual children. American Journal of Speech-Language Pathology, 29(3), 1212–1225. 10.1044/2019_AJSLP-19-00146 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daub, O., Skarakis-Doyle, E., Bagatto, M. P., Johnson, A. M., & Cardy, J. O. (2019). A comment on test validation: The importance of the clinical perspective. American Journal of Speech-Language Pathology, 28(1), 204–210. 10.1044/2018_AJSLP-18-0048 [DOI] [PubMed] [Google Scholar]
- DeVellis, R. F. (2017). Factor analysis. In Scale development: Theory and applications (4th ed., pp. 153–204). Sage. [Google Scholar]
- Dollaghan, C. A. (2007). The handbook for evidence-based practice in communication disorders. Brookes. [DOI] [PubMed] [Google Scholar]
- Dollaghan, C. A., & Horner, E. A. (2011). Bilingual language assessment: A meta-analysis of diagnostic accuracy. Journal of Speech, Language, and Hearing Research, 54(4), 1077–1088. 10.1044/1092-4388(2010/10-0093) [DOI] [PubMed] [Google Scholar]
- Ebert, K. D., Ochoa-Lubinoff, C., & Holmes, M. P. (2020). Screening school-age children for developmental language disorder in primary care. International Journal of Speech-Language Pathology, 22(2), 152–162. 10.1080/17549507.2019.1632931 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ebert, K. D., & Pham, G. (2017). Synthesizing information from language samples and standardized tests in school-age bilingual assessment. Language, Speech, and Hearing Services in Schools, 48(1), 42–55. 10.1044/2016_LSHSS-16-0007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ehrler, D., & McGhee, R. (2008). Primary Test of Nonverbal Intelligence (PTONI). Pro-Ed. [Google Scholar]
- Eitel, S., Tran, H. V., & Management Systems International. (2017). Speech and language therapy assessment in Vietnam. The United States Agency for International Development (USAID): Vietnam Evaluation, Monitoring and Survey Services Project (VEMSS).https://pdf.usaid.gov/pdf_docs/PA00MJHP.pdf [PDF] [Google Scholar]
- Embretson, S. E., & Reise, S. P. (2000). Polytomous IRT models. In Item response theory for psychologists (pp. 95–124). Erlbaum. [Google Scholar]
- Friendly, M., & Fox, J. (2021). candisc: Visualizing generalized canonical discriminant and canonical correlation analysis (R package Version 0.8-6) [Computer software]. https://CRAN.R-project.org/package=candisc
- Frizelle, P., O'Neill, C., & Bishop, D. V. (2017). Assessing understanding of relative clauses: A comparison of multiple-choice comprehension versus sentence repetition. Journal of Child Language, 44(6), 1435–1457. 10.1017/s0305000916000635 [DOI] [PubMed] [Google Scholar]
- Gagarina, N. V., Klop, D., Kunnari, S., Tantele, K., Välimaa, T., Balčiūnienė, I., Ute, B., & Walters, J. (2019). MAIN: Multilingual Assessment Instrument for Narratives. ZAS Papers in Linguistics, 56, Article 155. 10.21248/zaspil.56.2019.414 [DOI] [Google Scholar]
- Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage. [Google Scholar]
- Leonard, L. B. (2014). Children with specific language impairment. MIT Press. [Google Scholar]
- Lopez-Raton, M., Rodriguez-Alvarez, M. X., Suarez, C. C., & Sampedro, F. G. (2014). OptimalCutpoints: An R package for selecting optimal cutpoints in diagnostic tests. Journal of Statistical Software, 61(8), 1–36. 10.18637/jss.v061.i08 [DOI] [Google Scholar]
- Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley. [Google Scholar]
- Marinis, T., & Armon-Lotem, S. (2015). Sentence repetition. In S. Armon-Lotem, J. de Jong, & N. Meir (Eds.), Assessing multilingual children: Disentangling bilingualism from language impairment (pp. 95–122). Multilingual Matters. 10.21832/9781783093137-007 [DOI] [Google Scholar]
- Miller, J., Andriacchi, K., & Nockerts, A. (2019). Assessing language production using SALT software. SALT Software.
- Pawłowska, M. (2014). Evaluation of three proposed markers for language impairment in English: A meta-analysis of diagnostic accuracy studies. Journal of Speech, Language, and Hearing Research, 57(6), 2261–2273. 10.1044/2014_jslhr-l-13-0189 [DOI] [PubMed] [Google Scholar]
- Peña, E. D., Gutierrez-Clellen, V., Iglesias, A., Goldstein, B., & Bedore, L. M. (2014). BESA: Bilingual English-Spanish Assessment Manual. AR-Clinical. [Google Scholar]
- Pham, B., & McLeod, S. (2016). Consonants, vowels and tones across Vietnamese dialects. International Journal of Speech-Language Pathology, 18(2), 122–134. 10.3109/17549507.2015.1101162 [DOI] [PubMed] [Google Scholar]
- Pham, G. (2023). A narrative approach to synthesizing research on Vietnamese bilingual and monolingual children. Journal of Speech, Language, and Hearing Research, 66(12), 4756–4770. 10.1044/2023_JSLHR-23-00047 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pham, G., & Ebert, K. D. (2020). Diagnostic accuracy of sentence repetition and nonword repetition for developmental language disorder in Vietnamese. Journal of Speech, Language, and Hearing Research, 63(5), 1521–1536. 10.1044/2020_JSLHR-19-00366 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pham, G., Pruitt-Lord, S., Snow, C. E., Nguyen, Y. H. T., Phạm, B., Dao, T. B. T., Tran, N. B. T., Pham, L. T., Hoang, H. T., & Dam, Q. D. (2019). Identifying developmental language disorder in Vietnamese children. Journal of Speech, Language, and Hearing Research,62(5), 1452–1467. 10.1044/2019_JSLHR-L-18-0305 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pham, G., & Snow, C. E. (2021). Beginning to read in Vietnamese: Kindergarten precursors to first grade fluency and reading comprehension. Reading and Writing, 34(1), 139–169. 10.1007/s11145-020-10066-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plante, E., & Vance, R. (1994). Selection of preschool language tests: A data-based approach. Language, Speech, and Hearing Services in Schools, 25(1), 15–24. 10.1044/0161-1461.2501.15 [DOI] [Google Scholar]
- Polišenská, K., Chiat, S., & Roy, P. (2015). Sentence repetition: What does the task measure? International Journal of Language & Communication Disorders, 50(1), 106–118. 10.1111/1460-6984.12126 [DOI] [PubMed] [Google Scholar]
- R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
- Redmond, S. M., Ash Plante, A. C., Christopulos, T. T., & Pfaff, T. (2019). Diagnostic accuracy of sentence recall and past tense measures for identifying children's language impairments. Journal of Speech, Language, and Hearing Research, 62(7), 2438–2454. 10.1044/2019_jslhr-l-18-0388 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riches, N. G. (2012). Sentence repetition in children with specific language impairment: An investigation of underlying mechanisms. International Journal of Language & Communication Disorders, 47(5), 499–510. 10.1111/j.1460-6984.2012.00158.x [DOI] [PubMed] [Google Scholar]
- Rujas, I., Mariscal, S., Murillo, E., & Lázaro, M. (2021). Sentence repetition tasks to detect and prevent language difficulties: A scoping review. Children, 8(7), Article 578. 10.3390/children8070578 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 34(S1), 1–97. 10.1007/BF03372160 [DOI] [Google Scholar]
- Samejima, F. (1996). The graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory. Springer. [Google Scholar]
- Schneider, P., Dubé, R. V., & Hayward, D. (2005). The Edmonton Narrative Norms Instrument.
- Smith, G. T., & McCarthy, D. M. (1995). Methodological considerations in the refinement of clinical assessment instruments. Psychological Assessment, 7(3), 300–308. 10.1037/1040-3590.7.3.300 [DOI] [Google Scholar]
- Taha, J., Stojanovik, V., & Pagnamenta, E. (2021). Sentence repetition as a clinical marker of developmental language disorder: Evidence from Arabic. Journal of Speech, Language, and Hearing Research, 64(12), 4876–4899. 10.1044/2021_JSLHR-21-00244 [DOI] [PubMed] [Google Scholar]
- Tomblin, J. B., Records, N. L., & Zhang, X. (1996). A system for the diagnosis of specific language impairment in kindergarten children. Journal of Speech, Language, and Hearing Research, 39(6), 1284–1294. 10.1044/jshr.3906.1284 [DOI] [PubMed] [Google Scholar]
- Trinh, T., Pham, G., Phạm, B., Hoang, H., & Pham, L. (2020). The adaptation of MAIN to Vietnamese. ZAS Papers in Linguistics, 64, 263–268. 10.21248/zaspil.64.2020.581 [DOI] [Google Scholar]
- Tucci, A., Plante, E., Vance, R., & Oglivie, T. (2019). Data-driven item selection for the Shirts and Shoes Test. Journal of Communication Disorders, 78, 46–56. 10.1016/j.jcomdis.2019.01.002 [DOI] [PubMed] [Google Scholar]
- Vietnam National Assembly. (2005). Luật giáo dục [The education law of Vietnam]. Vietnam Ministry of Justice. http://www.moj.gov.vn/vbpq/en/Lists/Vn%20bn%20php%20lut/View_Detail.aspx?ItemID=5484 [Google Scholar]
- VietSLP. (2020). VietSLP: Speech-language pathology resources for supporting Vietnamese children. https://vietslp.sdsu.edu/
- Wang, D., Zheng, L., Lin, Y., Zhang, Y., & Sheng, L. (2022). Sentence repetition as a clinical marker for Mandarin-speaking preschoolers with developmental language disorder. Journal of Speech, Language, and Hearing Research, 65(4), 1543–1560. 10.1044/2021_JSLHR-21-00401 [DOI] [PubMed] [Google Scholar]
- Yuan, H., & Dollaghan, C. (2020). Applying item response theory modeling to identify social (pragmatic) communication disorder. Journal of Speech, Language, and Hearing Research, 63(6), 1916–1932. 10.1044/2020_JSLHR-19-00284 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data set from this study is not publicly available but is available from the corresponding author for research purposes upon request.


