Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Nov 25.
Published in final edited form as: Text Speech Dialog. 2016 Sep 3;9924:470–477. doi: 10.1007/978-3-319-45510-5_54

Automatic scoring of a Sentence Repetition Task from Voice Recordings

Meysam Asgari 1, Allison Sliter 1, Jan Van Santen 1
PMCID: PMC7687922  NIHMSID: NIHMS1644393  PMID: 33244525

Abstract

In this paper, we propose an automatic scoring approach for assessing the language deficit in a sentence repetition task used to evaluate children with language disorders. From ASR-transcribed sentences, we extract sentence similarity measures, including WER and Levenshtein distance, and use them as the input features in a regression model to predict the reference scores manually rated by experts. Our experimental analysis on subject-level scores of 46 children, 33 diagnosed with autism spectrum disorders (ASD), and 13 with specific language impairment (SLI) show that proposed approach is successful in prediction of scores with averaged product-moment correlations of 0.84 between observed and predicted ratings across test folds.

Keywords: Automatic language assessment, autism spectrum disorders, language impairment

1. Introduction

Language disorders (LD) in childhood are associated with social stress and increased difficulties with peer relations in adolescence [1], as well as with risk for poor social adaptation and psychiatric disorders in adulthood [2]. Children with LD have problems with communication that negatively impacting long-term cognitive, academic, and psychosocial development [3]. There is a significant need for language assessment for early detection, diagnosis, screening, and progress tracking of language difficulties. However, assessment involves face-to-face sessions with a professional, which may not always be available or affordable. Clearly, there is a need for computer-based systems for automated speech-based language assessment.

Researchers have investigated some forms of automated speech assessment. Automated speech assessment has been conducted for assessment of ”speech intelligibility” for children post-cleft palate repair surgery and adults post laryngectomy [4]. Some effort has been made in automatic assessment of disordered speech by looking at intermediate phonological models and specifying automatic speech recognition (ASR) models by main pathology for improved computed and expert rated intelligibility [5]. Automated prosody assessment has also been conducted in studies of children with autism spectrum disorder (ASD) [6]. This paper showed that different forms of prosodic stress could be detected automatically.

Starting with describing the childrens’ speech corpus, in Section 2, we formulate an approach on language assessment of recalling sentences by creating automatic scoring methods that optimally predict gold standard human ratings. We use automatic speech recognition (ASR) to transcribe the spoken responses of children and then we predict item scores via a regression algorithm from ASR output. In our corpus, items are scored on a 0–3 scale. Subject-level scores are then predicted from standard summary statistics derived from estimated sentence-level item scores, as described in Section 3. The machine learning experiments and the results are reported and discussed in Section 4.

2. Corpus

The corpus was generated by the audio recordings of the ”Recalling Sentences” portion of the CELF-4 with a time-aligned transcription of the study subjects’ response. The study group includes 46 children, ages 6 to 9 years (mean age of 7 years 3 months), 18 of which were diagnosed as having an autism spectrum disorder (ASD) as well as a language deficit, 15 with ASD but no language deficit, and 13 with specific language impairment (SLI) but no ASD. The study group included 36 males and 10 females. ASD can present with and without language impairment. To tease apart these differences, it is often necessary to perform a comprehensive clinical language evaluation. The Clinical Evaluation of Language Fundamentals edition 4 (CELF-4) does precisely this. However, it’s made up of 18 subtests, all of which are currently scored on paper, by speech language pathologist [7].

2.1. Task

Each member of the study group was evaluated using the Clinical Evaluation of Language Fundamentals edition 4 (CELF-4). [8] CELF-4 is an individually administered test, designed by The Psychological Corporation and used by speech language pathologists to determine if a student (ages 5–21 years) has a language disorder or delay. The specific task that was recorded and provided the speech data for this experiment is the Recalling Sentences task where an examiner recites with increasingly long and syntactically complex sentences (32 sentences in English) (the prompt) that the child is asked to repeat verbatim. This task contributes to a core language score, an expressive language score, and language structure score. Scores vary from zero to three, with a single error (a word omission, a repetition, a word addition, transposition or substitution) results in a single point reduction. Two to three errors result in a score of 1. Four or more errors earn a zero. We note that a word transposition is scored as one error unless it changes the meaning, such as a subject-object reversal, in which case two errors are counted. As in other tests, items increase in difficulty, and to minimize frustration a stopping rule is used after a certain number of errors. The examination is discontinued if the child receives five consecutive scores of zero. Scores were generated by a certified expert and independently verified for reliability.

2.2. Manual Scoring

The CELF examinations were conducted by a speech language pathologist and audio recorded. Those recordings were then time-align transcribed by linguists. Neologisms were phonetically transcribed. Regular developmentally appropriate phoneme replacement was disregarded. Using the transcript’s time alignment, the portions of the audio where the child was responding to the prompt were selected and paired with their transcription. Table 1 shows the distribution of manually rated scores across all 1083 sentences in the corpus.

Table 1.

The frequency of scores across all recordings

Scores 0 1 2 3
# of sentences 380 175 158 370

3. Method

3.1. Automatic scoring of a Sentence Repetition Task

Our proposed method consists of two main components, an ASR system and a machine learning based scoring algorithm as described in the following. First, we employ an ASR system to automatically transcribe the repeated sentences. Typically, an ASR system is used to produce the single, highest-likelihood transcription of the child’s spoken response. However, it is often the case that the most accurate transcript is not the one receiving the highest likelihood. Therefore, instead of relying on the highest-likelihood transcription, we generate, via ASR, the 10 highest-likelihood transcriptions of each response, and compute the Levenshtein distance of each transcription to the stimulus sentence. For example, the Levenshtein distance of the response the boy ate cookie to the stimulus sentence the boy dropped the cookie is 2, because there is one substitution ( dropped to ate) and one deletion ( the). This gives us the transcription with the lowest WER, often known as the oracle.

Next, we encode the transcription with the lowest word error rate (WER), oracle, using the following features: numbers of insertions, deletions, and substitutions; WER; and Levenshtein distance itself. The results in a five dimensional per-sentence features vector. For a given subject, we then compute subject-level features by applying standard summary statistics including mean, median, standard deviation, and entropy over the aforementioned per-sentence feature vectors derived from the first N (12 in our experiments) items presented to that subject. We also capture interaction between per-sentence features by computing the covariance matrix (upper triangular elements) of features. This generates a global feature vector of fixed dimension for each subject. Finally, we use these features to predict the total score (here defined as the average item score for the first N items), using support vector regression (SVR) models.

3.2. Automatic Speech Recognition

Learning acoustic models in ASR systems requires a fairly large amount of training data, which is mostly beyond the scope of data collection for specialized populations. We tackle this issue by adding a large childrens speech database to our small corpus for learning acoustic models. For automatic transcription of the recordings, we built a context-dependent HMM-GMM system with 39-dimensional MFCC features with delta and delta-delta coefficients, using the state-of-the-art Kaldi speech recognition toolkit [9]. We used the OGI Kids Speech Corpus, consisting of 27 hours of spontaneous speech from 1100 children, from kindergarten through grade 10 [10] for training acoustic models. After cepstral mean and variance normalization and LDA, we employed model space adaptation using maximum likelihood linear regression (MLLR). Also, speaker adaptive training (SAT) of the acoustic models was performed by both vocal tract length normalization (VTLN) and feature-space adaptation using feature space MLLR (fMLLR). A trigram language model was built on OGI Kids Speech Corpus using the SRILM toolkit [11]. The WER on a 2-hour held-out test corpus was approximately 26%, and adding another path of recognition to decode with the oracle reduced the WER to 14.27%.

4. Experiments

As mentioned above, the corpus included children ages 6–9 years, 36 males and 10 females. We evaluated the performance of our proposed method on two tasks on our corpus: 1) a four-class classification task to classify sentence-level ratings and 2) a regression task to predict subject-level scores. The manual scoring describes how precisely the prompt has been repeated by participants and provides a reference for the automatic scoring. From the transcriptions generated by described ASR system, we extracted the number of insertions, deletions, substitutions, WER, and Levenshtein distance from a pair of prompt and its associated automated transcription constructing a feature vector of five dimension per each recording. For comparison, we re-performed the same analysis over manual transcripts and extracted the same set of features from them. We also excluded sentences for which the raters took semantic changes into account for scoring the recordings.

4.1. Multi-class classification

We evaluated the performance of the proposed method on learning a 4-way classification model to predict the sentence-level scores employing support vector machine (SVM) classifier using radial basis function (RBF) and linear kernels implemented in scikit-learn toolkit [12]. Manually rated scores vary from zero to three according to the number of errors (a word omission, a repetition, a word addition, transposition or substitution) seen in the repeated sentence. Class labels of 3, 2, 1, and 0 correspond to zero, one, two or three, and four or more errors, respectively. In order to estimate the optimal set of model parameters, we used a five-fold cross validation scheme, setting all model parameter using four of the five sets as training set, and using the fifth ones only at the testing time. Parameters of the optimal SVM model were estimated on the training set separately for each fold, via grid search and cross-validation. As shown in Table 1, the class distributions are not balanced and thus the classification accuracy may not well describe the performance of the classifier. In order to address this drawback, we adopt unweighted average recall (UAR) as the performance criteria that normalizes the effect of skewed classes. Table 2 reports the performance of different classifiers measured in terms of UAR for classifying sentence ratings into four classes. From the results, it is clear that our proposed method applied on all three forms of transcription significantly outperforms the chance model in terms of UAR. Also, results suggest that SVM model with linear kernel is more suitable for this task than the non-linear RBF kernel. Furthermore, the comparable UAR between ASR-Oracle and Manual suggests that ASR system can be employed effectively used for this task.

Table 2.

Unweighted Average Recall

Model Chance ASR ASR-Oracle Manual
SVM-linear 0.24 0.55 0.59 0.61
SVM-RBF 0.24 0.54 0.57 0.58

We also report the detailed performance of our 4-way classification system (Linear SVM) in terms of precision,recall, and F1-score in Tables 3, 4, and 5 for predicting the sentence-level true scores using ASR, ASR-Oracle, and manual transcriptions, respectively.

Table 3.

ASR

class label precision recall F1-score
0 0.77 0.85 0.80
1 0.36 0.26 0.30
2 0.29 0.17 0.20
3 0.70 0.87 0.77

Table 4.

ASR-Oracle

class label precision recall F1-score
0 0.81 0.85 0.83
1 0.32 0.35 0.38
2 0.29 0.18 0.21
3 0.76 0.88 0.81

Table 5.

Manual transcription

class label precision recall F1-score
0 0.84 0.91 0.87
1 0.51 0.39 0.43
2 0.34 0.30 0.30
3 0.77 0.86 0.80

4.2. Regression

Subject-level scores

According to the distribution of manual ratings plotted in Figure 1, sentence difficulty varies across sentences with less variation in small sentences (1 to 12). In the other words, number of errors seen in the repeated sentences is directly proportional to the length of the prompt. In order to assess subject-level performance in the repetition task, we took the first 12 sentences presented to the subject and averaged across their sentence-level ratings. This gave us a unique per-subject metric describing overall performance of the subject. Through a regression model, we aim to predict this score for every subject in our corpus.

Fig.1.

Fig.1.

Distribution of manually rated scores as a function of number of words in the sentence

Features

Per-subject features are computed across all N=12 sentences by applying standard summary statistics including mean, median, standard deviation, and entropy to the 5-dimensional per-sentence feature vectors as described earlier. We also capture interaction between per-sentence features by computing the covariance matrix (upper triangular elements) of features. This results in a global 35 dimensional feature vector for each subject.

Learning strategies

We investigated two forms of regularization, L2-norm in ridge regression, and hinge loss function in support vector regression. These two learning strategies were evaluated on our data set using five cross-validation scheme with the scikit-learn toolkit [12]. The averaged mean absolute error (MAE) between predicted and gold standard scores across five test folds employing two learning strategies are presented in Table 6. We also show in Table 7, the performance of learning strategies in predicting the subject-level scores in terms of averaged product-moment correlations between observed and predicted ratings across test folds. From the results, it is observed that Linear SVR outperforms ridge regression. Also, slightly higher MAE using ASR-Oracle compare to Manual transcription suggests that proposed automatic method using an ASR system can be effectively employed for assessing the sentence repetition tasks.

Table 6.

Mean Absolute Error (MAE) between observed and predicted ratings. The K is the number of features.

Model Manual ASR-Oracle ASR K
Ridge 0.10 0.11 0.21 35
Linear SVR 0.035 0.10 0.13 35
Table 7.

Average product-moment correlations between observed and predicted ratings. K is the number of features.

Model Manual ASR-Oracle ASR K
Ridge 0.93 0.83 0.73 35
Linear SVR 0.96 0.84 0.75 35

5. Conclusions

In this paper, we showed that sentence repletion task can be automatically scored and described an automatic scoring system. The scoring system applies ASR to the spoken responses, optionally computes estimated item scores via machine learning (ML) from ASR output, and computes estimated total scores by applying ML either to estimated item scores or directly to ASR output. Confining the analysis to those 41 children (12 with ALN, 18 with ALI, and 11 with SLI) who had been given at least the first N=12 items, and computing each childs average of their scores on these items, we found that the Mean Absolute Error of observed and estimated total scores was 0.10 (observed score range of 0.83–3.0); the product moment correlation was 0.84.

6. Acknowledgements

We thank Katina Papadakis for manually transcribing the corpus for this project. This research was supported by NIH award 1R01DC013996-01A1. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not reflect the views of the funding agencies.

References

  • 1.Conti-Ramsden Gina, Mok Pearl LH, Pickles Andrew, and Durkin Kevin, “Adolescents with a history of specific language impairment (sli): Strengths and difficulties in social, emotional and behavioral functioning,” Research in developmental disabilities, vol. 34, no. 11, pp. 4161–4169, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Clegg Judy, Hollis Chris, Mawhood Lynn, and Rutter Michael, “Developmental language disorders–a follow-up in later adult life. cognitive, language and psychosocial outcomes,” Journal of Child Psychology and Psychiatry, vol. 46, no. 2, pp. 128–149, 2005. [DOI] [PubMed] [Google Scholar]
  • 3.Beitchman Joseph H., Language, learning, and behavior disorders: Developmental, biological, and clinical perspectives, Cambridge University Press, 1996. [Google Scholar]
  • 4.Maier Andreas, Haderlein Tino, Eysholdt Ulrich, Rosanowski Frank, Batliner Anton, Schuster Maria, and Nöth Elmar, “Peaks–a system for the automatic evaluation of voice and speech disorders,” Speech Communication, vol. 51, no. 5, pp. 425–437, 2009. [Google Scholar]
  • 5.Middag Catherine, Martens Jean-Pierre, Van Nuffelen Gwen, and De Bodt Marc, “Automated intelligibility assessment of pathological speech using phonological features,” EURASIP Journal on Advances in Signal Processing, vol. 2009, no. 1, pp. 1–9, 2009. [Google Scholar]
  • 6.Van Santen Jan PH, Prudhommeaux Emily Tucker, and Black Lois M, “Automated assessment of prosody production,” Speech communication, vol. 51, no. 11, pp. 1082–1097, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Paslawski Teresa, “The clinical evaluation of language fundamentals, (celf-4): a review,” Canadian Journal of School Psychology, vol. 20, no. 1/2, pp. 129, 2005. [Google Scholar]
  • 8.Semel Eleanor Messing, Wiig Elisabeth H, and Secord Wayne, CELF 4: clinical evaluation of language Fundamentals, Pearson: Psychological Corporation, 2006. [Google Scholar]
  • 9.Povey Daniel, Ghoshal Arnab, Boulianne Gilles, Burget Lukas, Glembek Ondrej, Goel Nagendra, Hannemann Mirko, Motlicek Petr, Qian Yanmin, Schwarz Petr, Silovsky Jan, Stemmer Georg, and Vesely Karel, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding Dec. 2011, IEEE Signal Processing Society, IEEE Catalog No.: CFP11SRW-USB. [Google Scholar]
  • 10.Shobaki Khaldoun, Hosom John-Paul, and Cole Ronald, “The ogi kids’ speech corpus and recognizers,” in Proc. of ICSLP, 2000, pp. 564–567. [Google Scholar]
  • 11.Stolcke Andreas, “Srilm - an extensible language modeling toolkit,” 2002, pp. 901–904. [Google Scholar]
  • 12.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, and Duchesnay E, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Google Scholar]

RESOURCES