Feature engineering and machine learning for computer-assisted screening of children with speech disorders

Kerul Suthar; Farnaz Yousefi Zowj; Marisha Speights Atkins; Q Peter He

doi:10.1371/journal.pdig.0000041

. 2022 May 26;1(5):e0000041. doi: 10.1371/journal.pdig.0000041

Feature engineering and machine learning for computer-assisted screening of children with speech disorders

Kerul Suthar ^1,^#, Farnaz Yousefi Zowj ^1,^#, Marisha Speights Atkins ^2,^*, Q Peter He ^1,^*

Editor: Henry Horng-Shing Lu³

PMCID: PMC9931328 PMID: 36812555

Abstract

Auditory perceptual analysis (APA) is the main method for clinical assessment of speech-language deficits, which are one of the most prevalent childhood disabilities. However, results from APA are susceptible to intra- and inter-rater variabilities. There are also other limitations of manual or hand transcription-based speech disorder diagnostic methods. There is increased interest in developing automated methods that quantify speech patterns for diagnosing speech disorders in children to address these limitations. Landmark (LM) analysis is an approach that characterizes acoustic events occurring due to sufficiently precise articulatory movements. This work investigates the utilization of LMs for automatic speech disorder detection in children. Besides the LM-based features that have been proposed in existing research, we propose a set of novel knowledge-based features that have not been proposed before. A systematic study and comparison of different linear and nonlinear machine learning classification techniques based on the raw features and the proposed features is conducted to assess the effectiveness of the novel features in classifying speech disorder patients from normal speakers.

1 Introduction

Speech-language deficits are one of the most prevalent childhood disabilities affecting about 1 in 12 children between three and five years old [1]. Despite the recognition that early identification and treatment of communication disorders is important for school readiness and has been shown to significantly improve communication, literacy, and mental health outcomes for young children [1–3], approximately 40% of children with speech and language disorders do not receive intervention because their impairment goes undetected [4,5]. Auditory perceptual analysis (APA) is the main method for clinical assessment of disordered speech; however, results from APA are susceptible to intra- and inter-rater variabilities [6]. Another factor to consider is that some children may be reluctant to participate in long testing sessions [7], and even if they do, transcription of large data sets of audio recordings is time-consuming and requires a high level of expertise from therapists [8,9]. These limitations of manual or hand transcription based diagnostic assessment methods have led to an increasing need for automated methods to quickly and consistently quantify child speech patterns and help them be diagnosed if they have impaired speech [10]. Landmark (LM) analysis is such an approach that characterizes speech with acoustic markers that are developed based on the LM theory of speech perception [11–13]. Unlike automatic speech recognition (ASR), LM analysis does not attempt to identify words, but rather to detect acoustic events that occur as the result of sufficiently precise articulatory movements. LM analysis has been suggested as the basis for automatic speech analysis [6]. Therefore, in this work we focus on the utilization of LMs for automatic speech disorder detection in children, with LMs extracted using a publicly available software: SpeechMark toolbox [6]. SpeechMark not only analyzes physical aspects of the signal but also applies acoustic knowledge of articulatory features in the process of analysis [6]. As a result, it has been utilized in numerous studies to extract LMs for various applications such as the detection of stress [14], depression [15], emotion [16], and sleep deprivation [17]. Here we briefly review how it works. SpeechMark divides the computed spectrogram into six frequency bands, and then fine and coarse processing steps are conducted to detect band-energy rise and to determine threshold for peak detection. Finally, energy peaks are located and LM types are determined based on the patterns of changes in the frequency bands [12,13]. The description of each landmark detected by this tool and used in this study are presented in Table 1. An example of LMs detected by SpeechMark from the speech of a speaker uttering a word are shown are shown in Fig 1.

Table 1. Description of landmarks used in this study.

Landmark	Description
g (glottis)	Onset (+) and offset (-) of sustained motion of vocal fold
b (burst)	Onset (+) and offset (-) of frication or bursts in an unvoiced segment
s (syllabicity)	Release (+) and closure (-) of sonorant consonant in a voiced segment
f (unvoiced frication)	Onset (+) and offset (-) of frication in an unvoiced segment
v (voiced frication)	Onset (+) and offset (-) of frication of in a voiced segment

Open in a new tab

An extension of SpeechMark for the landmark system, namely automatic syllabic cluster analysis, was recently proposed that clusters the LMs into syllabic units [18]. These LMs are grouped based on specific rules, such as at least a 30 ms voiced segment is required in a syllabic cluster (SC) [19]. Studies show that SC patterns are good indicators of differences between normal and disordered speakers [18,19]. Variations in articulatory exactness in normal and disordered speakers have proven to be related to LM and SC patterns [19]. This is validated by existing studies showing that simple count of LMs and/or SCs can be used for classification of gender [6], Parkinson’s disease [20], and sleep deprivation [17]. Counting of individual LMs, a.k.a. unigrams, does not consider the specific order or sequence of the LMs, which may contain important information about the speech. n-gram, which is a generalization of unigrams and is defined as a sequence of n consecutive LMs, takes the specific LM order into consideration when n≥2 [21]. It was found that n-gram counts (n = 1,2,3,4) were good features for depression detection [15,22]. For example, in [22], SpeechMark was used to extract landmarks from a large dataset consisting of recordings from smartphones. Two sets of features were proposed based on speech landmark bigrams, i.e., bigram-count and LDA-bigram. The first set calculates the frequencies of bigrams, and the second set detects latent patterns from bigrams using natural language text processing. A linear support vector machine (SVM) classifier was trained using the two sets of features. It was found that the bigram features increased the accuracy of the SVM classifier from 72.9% when only acoustic features were used to 78.7% when either bigram-count or LDA-bigrams were utilized. The speech landmark bigram features improved the F1(depressed) by 30.1% compared to acoustic features [22]. Besides n-gram count, time-based LM features have also been proposed in the literature. These time-based LM features include durations of the bigrams (i.e., 2-grams) and LM pairs (i.e., onset and offset of a LM as defined in Table 1) [15], and speech rate, which is defined as the number of phonetic units, such as syllables or words, uttered per unit time [23,24].

In this work, we have adopted count of n-grams as well as duration and rate features based on LMs and n-grams. One contribution of this work is to propose novel knowledge-based features that have not been proposed before and to demonstrate the effectiveness of these new features in detecting childhood speech disorder. For example, this work studies features that are the ratios of the count of n-grams (n≥2) to that of unigrams. The idea of considering ratio is similar to the body-mass index (BMI) where the weight itself cannot determine whether a person is overweight or not. BMI takes the height of the person into account as well. The ratios are usually better features than the absolute individual values in addressing the individual variations of samples within the same class. This point is further validated in this study. Another contribution is to perform systematic feature selection to identify key features and quantify their contributions to the classification of patients with speech disorder from normal controls. The final contribution of this work is a systematic study and comparison of different linear and nonlinear machine learning classification techniques and their effectiveness in classifying speech disorder patients from normal speakers.

The remainder of this work is organized as follows. Section 2 describes materials used in this study, which include the general information about the speakers of which the speech samples were collected, the conditions and procedures the speech samples were processed to obtain the dataset, and the features proposed to be studied in this work. Section 2 also introduces the analytical methods used in this study, which include methods to address the data imbalance, introduction of machine learning (ML) classification techniques used in this work, and the procedure and criteria used to evaluate the performance of different ML techniques in screening children with speech disorders. Section 3 presents results and discussions of this work, and Section 4 draws some conclusions.

2 Materials and methods

2.1 Ethical considerations

Ethical approval for human subject data collection was granted by the University of Cincinnati for this study with reference number 2015–3023 and subsequently at Auburn University with reference number 17–203 EP 1705 for ongoing data analysis and additional data collection. Permissions were also sought from schools and university clinics where data were collected. Anonymity and confidentiality were explained to participants. Participants were assured that withdrawal from study would not harm them in any way. Informed consent forms were filled and signed by parents with verbal assent from the child participants. Participant’s data was de-identified with codes to ensure anonymity. All data analyses in this work are conducted using the de-identified data.

2.2 Speakers

The speech of 52 children ages 33–94 months (with mean 51.52 and standard deviation 10.16) was retrieved from the Speech Evaluation and Exemplars Database (SEED) [25]. Due to missing values, one sample was dropped from this work. Of the 51 remaining children, 39 were typically developing without speech or language disorder, and 12 were diagnosed with speech sound disorder without language impairment. All children were required to demonstrate normal hearing using the criterion of sound detection at 20 dB HL for pure tones at 500, 1000, 2000, and 4000 Hz. Participants were required to exhibit age-appropriate receptive language skills on the CELF Preschool-2 [26]. Age-appropriate performance was determined by scores falling within one standard deviation of the mean (standard score > 85). Children were classified as typically developing or with speech disorder using the Clinical Assessment of Articulation and Phonology-2 or the Diagnostic Evaluation of Articulation [27]. Children with standard scores ≤ 85 (one standard deviation below the mean) were assigned to the with speech disorder group. Children with concomitant language disorders were not included in the study.

2.3 Dataset

The speech samples retrieved were recorded in local community early education centers or in the lab. Sound levels were measured prior to each recording session to determine if the environmental noise level was below 40 dBA SPL in both the school and lab environment (Williams, Zhou, Stewart, & Knott, 2016). Speech samples were recorded at a 44K sampling rate at 24-bit depth using a handheld ZOOM H6N recorder (Zoom North America) with cardioid XLR MOVO LV402 microphones (MOVO). Speech samples retrieved for this study were one of the Triage 10 word set from Anderson and Cohen [28]: flower. Acoustic landmarks, including +/-g, +/-b, +/-s, +/-f, and +/-v, as well as syllabic clusters were obtained using the SpeechMark MATLAB toolbox (STAR Corp., MA).

2.4 Feature engineering

The raw features extracted from audio recordings using the SpeechMark Toolbox include time stamp and strength of each LM listed in Table 1, plus SC count. As discussed previously in Sec. 1, in this work we have adopted all LM and SC based features proposed in the literature, including n-gram counts, and duration and rate features based on LMs and n-grams. These features are listed in the top rows of Table 2. In addition, we explore LM strength based features and propose n-gram ratio based features to better address within-class variations as discussed in Sec. 1. These new features are listed in the bottom rows of Table 2. After removing illegitimate or trivial features (e.g., n-gram counts that are all zeros, or ratios with a denominator of zero), there are 303 unique features generated based on the criteria listed in Table 2.

Table 2. Features employed in this study–including features adopted from literature and features proposed in this work.

Feature category	Description	Unit
Features adopted from literature
Unigram count	Number of each unigram type	#
Bigram count	Number of each bigram type	#
Trigram count	Number of each trigram type	#
Average bigram duration	Average duration of all bigrams	s
Average trigram duration	Average duration of all trigrams	s
Duration of LM pair	Average duration of LM pairs (i.e., onset-offset) of each LM type	s
Unigram rate	Count of all unigram types per unit time	#/s
Bigram rate	Count of all bigram types per unit time	#/s
Trigram rate	Count of all trigram types per unit time	#/s
Syllabic cluster count	Number of syllabic clusters	#
Speech rate	Syllabic cluster count per unit time	#/s
New features proposed in this work
Strength of unigram	Average strength of each unigram type	%
Strength of bigram	Average strength of each bigram type	%
Strength of trigram	Average strength of each trigram type	%
Strength change	Average absolute strength difference of two consecutive LMs	%
Average bigram strength	Average strength of all bigram types	%
Average trigram strength	Average strength of all trigram types	%
Unigram/unigram ratio	Ratio of unigram counts of each type	-
Bigram/unigram ratio	Ratio of bigram count to unigram count of each type	-
Trigram/unigram ratio	Ratio of trigram count to unigram count of each type	-

Open in a new tab

2.5 Feature selection

It has been shown by many studies that the performances of classification methods can be significantly improved if only the relevant features are included as the predictors. Feature selection can also reduce the risk of overfitting, which is especially important when the number of samples are relatively small compared to the number of features (such as the case of this study). Finally, feature selection can reduce model complexity, making result interpretation easier. As a result, feature selection has been one of the most important practical concerns in data-driven approaches. In the past few decades, many different feature selection approaches have been reported for various modeling and classification applications. For more detailed discussions on various feature selection methods, the readers are referred to some recent review articles.

In this work, a two-step feature selection procedure is proposed. In the first step, the redundant features (i.e., the features that are highly correlated with an existing feature) are removed. In the study, a Pearson correlation coefficient of 0.99 is used as the criterion to determine whether a feature is redundant with an existing feature or not. After this step, the number of features is reduced to 189 from the original 303 features, indicating that there is significant redundancy among the original features.

In the second step, the recursive feature elimination with cross-validation (RFECV) from scikit-learn is utilized with the default 5-fold cross-validation. Fig 2 shows the cross-validation score vs. number of features when a linear discriminant analysis (LDA) model is used as the classifier. Fig 2 indicates that only 10 features are needed to obtain the optimal cross-validation score. The 10 features selected are listed in Table 3. As can be seen from Table 3, nine out of the ten features are new features proposed in this work that have not been utilized before. Among the nine new features, seven are ratio-based features and two are strength-based features.

Table 3. The ten features selected based on RFECV.

Feature category	Feature specifics
Ratios of bigram count to unigram count	’-g-b/+g’
Ratios of trigram count to unigram count	’-s+s-s/+g’
Ratios of trigram count to unigram count	’+b+g+s/+g’
Ratios of bigram count to unigram count	’-b+g/+g’
Trigram counts	’+b-b+b’
Ratios of trigram count to unigram count	’-s-g+b/+g’
Ratios of trigram count to unigram count	’-g+b-b/+g’
Strength of trigrams	’+g-g-b’
Strength of unigrams	’-f’
Ratios of trigram count to unigram count	’+b+g-v/+g’

Open in a new tab

2.6 Sample imbalance

Among all 51 samples, 39 samples belong to normal speakers, while the remaining 12 samples belong to the disordered speakers, which indicates an approximately 3:1 class imbalance between the normal speaker samples and the disordered ones. However, most machine learning classification algorithms are developed with the implicit assumption of approximately equal samples in each class. Therefore, data with imbalanced or skewed classes may result in poor classification performance for the minority class samples. For example, if the model is tuned using accuracy, the resulted model may lead to mostly correct classification of the majority class at the cost of poor classification of the minority class. However, the correct classification of minority class samples is often more critical as they represent the disease group most of the time–misclassification of these samples leads to low sensitivity. Several ways of dealing with class imbalance have been proposed in the literature such as under-sampling, over-sampling, synthetic sample generation, using cost-sensitive methods, and applying penalties or weights based on class ratio [29]. Under-sampling reduces the number of samples in the majority class to improve the imbalance ratio, while oversampling refers to increasing the number of samples in the minority class samples. Oversampling is used more often than under-sampling to maximumly utilize available samples. Random oversampling refers to an increase in minority class samples through duplication of randomly selected minority class samples. However, this most straightforward approach that duplicates the existing samples does not add new information during training and is not considered robust. A more robust method when oversampling a dataset is the synthetic minority over-sampling technique (SMOTE), in which new samples are synthesized from the existing samples [30]. SMOTE is utilized in this work to address the class imbalance issue and we briefly review the technique and the implementation details in the following subsection.

2.7 Synthetic minority over-sampling technique (SMOTE)

Based on the feature space of the minority samples, SMOTE first selects a minority class instance or sample at random (denoted as a) and then finds its k nearest minority class neighbors. The synthetic neighbor is created by selecting one of the k nearest neighbors at random (denoted as b) and connecting them to form a line segment in the feature space. The synthetic sample is generated as a combination or linear interpolation between the two chosen samples, a and b, as follows.

x_{n e w} = x_{a} + λ (x_{b} - x_{a})

(1)

where x_i denotes the feature vector (i.e., a point in the feature space) of sample i, λ is a random number in the range [0, 1]. For features that only take integer values (e.g., n-gram counts), x_new is rounded to the nearest integer. More information on SMOTE can be found in [30]. For implementation, Python-based library imbalanced-learn was used in our work for SMOTE oversampling [31]. Due to limited data, we first randomly isolate one sample from each class for testing. Once the two random samples, i.e., one from a normal speaker and one from a disordered speaker, are removed from the set, we apply SMOTE oversampling to balance the dataset. The primary purpose behind separating test samples before oversampling is to avoid bias in the model due to the test samples’ influence on the synthetic samples created. After removing one sample from each class for testing, we have 38 samples from normal speakers and 11 samples from disordered speakers in the training set. SMOTE oversampling is applied on the training set, where 27 samples in the minority class, i.e., disordered speakers, are generated to balance the dataset.

2.8 Monte-Carlo cross validation and testing (MCVT)

Once the training set is balanced, we train different classification models, perform feature selection and tune their hyperparameters using 10-fold cross-validation on the training set. Then we apply the models to the left-out test samples, and report the sensitivity and specificity of each model. This whole procedure is referred to as one Monte-Carlo validation and testing (MCVT) [32]. We report the mean and standard deviation of sensitivity and specificity of 50 such MCVT runs, which is a robust way of comparing different modeling techniques and assessing their performances. MCVT avoids overfitting by randomly selecting and isolating two test samples first, then utilizing SMOTE technique to balance the rest of the dataset, which is further split into training and validation for modeling training, feature selection, and hyperparameter tuning. For hyperparameter tuning, we use a 10-fold stratified cross-validation (CV) to select the optimal hyperparameters. The schematic of the proposed MCVT procedure is shown in Fig 3.

Sensitivity and specificity are two most commonly used critical metrics when dealing with binary classification problems in healthcare. Sensitivity is the true positive rate, i.e., the classifier’s ability to detect diseased patients correctly, and specificity is the true negative rate, i.e., the classifier’s ability to detect normal controls (i.e., the ones without diseases) correctly. We also use accuracy as a single measure when we need to evaluate the overall performance of a classifier. The mathematical definitions of these terms are given below.

S e n s i t i v i t y = \frac{n_{T P}}{n_{T P} + n_{F N}}

(2)

S p e c f i c i t y = \frac{n_{T N}}{n_{T N} + n_{F P}}

(3)

A c c u r a c y = \frac{n_{T N} + n_{T P}}{n_{T N} + {n_{T P} + n_{F N} + n}_{F P}}

(4)

where n_TP is the number of true positives, n_FN number of false negatives, n_TN number of true negatives, and n_FP number of false negatives. Sensitivity, specificity and accuracy all range from 0 to 1 (or 0~100%).

As shown in Fig 3, the mean or average sensitivity and specificity of MCVT runs can be used to assess the accuracy of a classifier; while the standard deviations of the sensitivity and specificity of the MCVT runs can be used to quantify the robustness of a classifier (i.e., how consistently a classifier performs when trained with randomly selected training samples). It is worth noting that because only one sample from each class is left out for testing in this work, the standard deviations of the sensitivity and specificity would be biased due to the extremely small sample size. Therefore, this measure of robustness is not utilized in this work.

2.9 The trade-off between sensitivity and specificity

For binary classification, there is often a trade-off between sensitivity and specificity. This trade-off can be visualized in a receiver operator characteristic (ROC) curve, which plots sensitivity vs. (1 –specificity) as illustrated in Fig 4. The area under the curve (AUC) is the summative measure of the classification capability of a classifier. A perfect classifier has an AUC of 1, while a classifier with random selection has an AUC of 0.5. In reality, a typical classifier has AUC between 0.5 and 1. In selecting operation points on ROC, often times the costs associated with misclassification of each type must be considered when trying to balance the sensitivity with the specificity. This balance is usually adjusted through class priors or class weights. For example, to use the proposed method as a screening tool, we may want to trade (or sacrifice) some specificity for higher sensitivity. This is because if a truly disordered speaker were misclassified as a normal speaker, he or she may miss the opportunity to be further examined by a speech specialist.

2.10 Classification techniques

In this work, four different classification algorithms, namely linear discriminant analysis (LDA), support vector machine (SVM), extreme gradient boosting (XGBoost), and random forest (RF). For LDA, we consider the effects of shrinkage, a form of regularization to avoid overfitting, along with class priors, to address the unequal costs of misclassification. For SVM, we examine the effects of different kernels, along with class weights. Throughout the modeling procedure, grid search and random search are used for hyperparameter tuning using the scikit-learn library in Python [33]. However, class weights or priors are not tuned automatically, and certain discrete values are considered for the study. We consider several categories based on features extracted using feature engineering and different levels of feature selection. We compare raw data with different feature categories.

2.10.1 Linear discriminant analysis

Linear discriminant analysis (LDA) is one of the most commonly used linear classification techniques used in machine learning. In this work, the LDA function from Python scikit-learn library [34] is used, which generates the linear decision boundary based on the Bayes’ rule by modeling the class conditional or posterior probability P(y = k|x_i) of each training sample of d features (i.e., x_i ϵ R^d) for each class k:

P (y_{i} = k | x_{i}) = \frac{P (x_{i} | y_{i} = k) P (y_{i} = k)}{P (x_{i})} = \frac{P (x_{i} | y_{i} = k) P (y_{i} = k)}{\sum_{l} P (x_{i} | y_{i} = l) P (y_{i} = l)}

(5)

where y is the class label and the class k that maximizes the posterior probability is selected.

Shrinkage is a form of regularization used in LDA to improve the estimation of covariance matrices in situations where the number of training samples is small compared to the number of features. The effect of shrinkage is studied in this work. In addition, the prior probability, P(y = k), is studied on its effectiveness in addressing the unequal costs of misclassification. More information can be found in [34].

2.10.2 Support vector machine

Support vector machine (SVM) is a classification approach developed in the 1990s. Various SVMs have shown superior performance in a variety of settings and are often considered one of the best “out of the box” classifiers [35]. For simple interpretation and to reduce the risk of overfitting, in this work we focus on two-class linear SVM. Consider n samples each with d features (i.e., x_i ϵ R^d, i = 1,…,n) and their labels y_iϵ{+1,−1}, linear SVM identifies a hyperplane, which is a liner function in the feature space, i.e., f(x) = 〈w, x〉+b, where w is the coefficient vector, b is a real constant, and 〈∙,∙〉 denotes the dot production in the feature space. The hyperplane is placed such that a maximum distance between the two class samples (i.e., the class margin) is achieved. This is equivalent to the following minimization problem [36]:

\min_{w, b} \frac{1}{2} w^{T} w

(6)

s . t . y_{i} (< w, x_{i} > + b) - 1 \geq 0, \forall i

(7)

For a nonseparable case, a soft margin is introduced so that the minimization problem becomes

\underset{w, b, ξ}{m i n} \frac{1}{2} {‖ w ‖}^{2} + C \sum_{i = 1}^{n} ξ_{i}

(8)

s . t . y_{i} (< w, x_{i} > + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, \dots, n

(9)

where C and ξ_i are constants and ξ_i are called slack variables.

More information on VSM and its training can be found elsewhere [35,36]. In this work, Scikit-learn [33] with LIBSVM [37] library is used to implement linear SVM. For comparison, we also implemented SVM with nonlinear kernels, including polynomial, radial basis function (RBF), and sigmoid kernels.

2.10.3 Random forest

Random forest (RF) is an ensemble method of decision tree algorithms. It is an extension of bootstrap aggregation or bagging of decision trees. In bagging, each classifier’s training set is generated by random sampling, with or without replacement from all the samples available for training. Individual predictions of each classifier are aggregated based on a hard or soft voting scheme to form a final prediction. However, unlike bagging, RF also involves selecting a subset of input features at each split point in the construction of trees. Scikit-learn is used for implementation of RF. Hyperparameters, including number of trees, max tree depth, and number of features considered for the split, are tuned using random search hyper parameter optimization procedure. More information on RF can be found in [35,38].

2.10.4 Extreme gradient boosting (XGBoost)

Another decision tree-based ensemble method used in this work is boosting. In comparison to bagging, boosting approaches combine various homogenous weak learners and learn patterns sequentially in an adaptive way. Each of the sequential model depends on the previous ones. XGBoost is one of the most popular boosting approaches, which has been used widely and has achieved state-of-the-art results on many machine learning challenges [39]. XGBoost is an optimized distributed gradient boosting library, which is implemented under the gradient boosting framework. More information on XGBoost can be found in [39].

Results and discussion

In this work, we conduct investigation from two perspectives: (1) comparing classification performance when different feature sets are used, and (2) comparing classification performance when different classification techniques are used. When comparing different features, the following three feature sets are studied: (a) the original 21 features directly obtained from the SpeechMark Toolbox, which include the counts and strengths of the ten LMs (listed in Table 1, considering both onset and offset) for each sample, plus one syllabic count per sample; (b) the 189 features based on rational feature engineering with different feature types listed in Table 2 and after redundant features (i.e., Pearson correlation coefficient greater than or equal to 0.99) removed; and (c) The ten features selected from the 189 features via RFECV as discussed in Sec. 2.5. These ten features are listed in Table 3. When comparing different classification techniques, they are applied to all the three feature sets.

It is worth noting that all the results presented in this work are based on the unseen test data (i.e., they are not involved in any training steps such as feature selection or hyper-parameter tuning). Due to the small number of samples, we would not have enough data for model training if 20~30% of the dataset were left out for testing as usually recommended. As a result, we leave one sample out from each class for the test set (i.e., totally 2 samples in the testing set). To avoid bias or cherry-picking due to the small number of the test samples, we perform 50 Monte Carlo cross-validation and testing (MCVT) and use the average of the 50 MCVT runs for performance evaluation. As we have demonstrated previously, this method provides robust and fair evaluations even with small number of samples [32].

As shown in Table 4 and Fig 5, when the 21 raw features are used, SVM with RBF kernel provides the best overall classification performance with 75.0% accuracy (i.e., 75.0% of the samples are classified correctly). SVM with linear kernel provides the second-best result with 71.0% accuracy. The overall performances of all methods, linear or nonlinear, are relatively poor, indicating that the raw features are not very informative in classifying the two classes.

Table 4. Comparison of classification performance based on raw features.

Method	Sensitivity (%)	Specificity (%)	Accuracy (%)
LDA	64	54	59
SVM (Linear)	68	74	71
SVM (Poly)	78	32	55
SVM (RBF)	70	80	75
SVM (Sigmoid)	80	54	67
XGBoost	50	86	68
RF	28	76	52

Open in a new tab

Next, we apply different classification methods to the selected ten features obtained through rational feature engineering and selection. The results are listed in Table 5 and shown in Fig 6. By comparing Tables 4 and 5 (or Figs 5 and 6), we can see that the performances of all methods have improved in terms of sensitivity, specificity and accuracy. While some improvements are moderate, such as those of SVM (RBF) and XGBoost (with less than 10% improvement in accuracy), others are significant (with as high as 34% improvement in accuracy). Recall that out of the ten selected features, nine of them are newly proposed features. The notable improved performance with these features across all classification methods demonstrates that the proposed features are more informative than the raw features. Sincere there are seven features that are ratio based, the improved performance is most likely due to our hypothesis that ratio-based features are better at addressing individual variations of samples from the same class. The direct comparison of accuracy using the two sets of features are shown in Fig 7. In particular, LDA classifier achieves 94.0%, 92.0% and 93.0% in sensitivity, specificity and overall accuracy respectively. Several other methods also achieve nearly 90.0% in sensitivity, specificity and overall accuracy, including SVM with linear, polynomial and sigmoid kernels. In addition, using raw features has led to skewed or imbalanced sensitivity and specificity in several methods as shown in Fig 5. For example, SVM with polynomial and sigmoid kernels have high sensitivity but poor specificity, while XGBoost and RF have high specificity but poor sensitivity. In comparison, the sensitivity and specificity based on the selected engineered features are much more balanced.

Table 5. Comparison of classification performance based on rationally engineered and selected features.

Method	Sensitivity (%)	Specificity (%)	Accuracy (%)
LDA	94	92	93
SVM (Linear)	86	92	89
SVM (Poly)	84	94	89
SVM (RBF)	72	94	83
SVM (Sigmoid)	88	88	88
XGBoost	60	88	74
RF	76	94	85

Open in a new tab

As discussed previously, class prior probability or class weight can have an impact on sensitivity and specificity, which can also be adjusted to count for the unequal costs of misclassification (i.e., the cost of false positive vs. that of false negative). In this work, since we aim to develop a screening method, high sensitivity is more desirable as the false negative (i.e., children with speech disorder are misclassified as normal speakers) may miss the opportunity to be examined by a speech specialist. On the other hand, false positives will cause less harm other than the cost associated with the follow up examination. Since LDA performs the best among all methods and it is more robust than some of the other nonlinear methods, we focus on examining the impact of class priors on LDA. Three different class priors are studied, namely (0.5,05), (0.3, 07), and (0.1, 0.9). (0.5,0.5) indicates equal priors for normal and disordered classes. (0.1, 0.9) indicates eight times higher prior probability for the disordered class than the normal control group. The expectation is that the priors of (0.1, 0.9) would lead to higher sensitivity compared to (0.5, 0.5) or (0.3, 0.7). The results are shown in Table 6 and Fig 7, which indicate that the sensitivity and specificity are not significantly affected by the class priors. Specifically, there is no change in sensitivity and specificity when class priors are changed from (0.5,0.5) to (0.3, 0.7). There is slight increase in sensitivity (from 94.0% to 96.0%) when class priors of (0.1, 0.9) are used, while specificity is unchanged.

Table 6. LDA classification performance when different class priors are used.

Prior	Sensitivity	Specificity	Accuracy
(0.5,0.5)	94.0	92.0	93.0
(0.3,0.7)	94.0	92.0	93.0
(0.1,0.9)	96.0	92.0	94.0

Open in a new tab

More thorough examination of the trade-off between sensitivity and specificity the LDA classifier is shown in the ROC curve (Fig 8), which is obtained by changing class priors in a much wider range than the three cases presented previously. As discussed previously, the best possible classification method would yield a point in the upper left corner of the ROC space, representing 100% sensitivity and 100% specificity. As shown in Fig 8, with proper tuning, the ROC curve of LDA approaches that point with 96% sensitivity and 92% specificity. The ROC curve can serve as a visual tool for selecting LDA tuning parameters, i.e., class priors, based on cost/benefit analysis of the speech disorder screening decision making.

4. Conclusion

In this work, we propose an automated computer-assisted screening method for children with speech disorders. The main contribution of this work is to propose a set of novel knowledge-based features that have not been proposed before and to demonstrate the effectiveness of these new features in detecting childhood disorder. In particular, this work proposes specific and average strength of n-grams and ratio-based features. The ratio-based features have been found particularly informative in characterizing audio recordings for speech disorder detection. Similar to the idea of BMI metric used in obesity studies, the ratio-based features proposed in this work are hypothesized to better address the usually wide individual variations among samples from the same class than their individual components. This is validated by the results that significant improvements in classification are obtained based on these new features across different classification methods when compared with those based on the raw features. Similar to many other medical studies where the sample size of normal controls is significantly greater than that of patients, creating so-called sample imbalance problem that can negatively affect many conventional classification techniques. In this work, we found that SMOTE is an effective and easy to implement technique to address this issue. However, cautions must be taken to ensure that the synthesized samples mimic the true minority samples in terms of feature properties (e.g., n-gram counts can only take non-negative integers). In addition, to avoid overfitting, the synthetic samples should not be used as test samples. To improve classification performance and reduce the risk of overfitting, as well as to reduce model complexity, in this work we propose a two-step feature selection procedure. In the first step, the highly correlated and redundant features are removed through the evaluation of their Pearson correlation coefficients. In the second step, RFECV is utilized to further reduce the number of features. Through this two-step procedure, it is found that only ten features are needed to obtain the optimal cross-validation accuracy. Based on the raw features and the selected ten features, a systematic study and comparison of different linear and nonlinear machine learning classification techniques is conducted. It is found that with raw features, all classification methods, linear or nonlinear, fail to achieve high classification performance. In comparison, with the ten selected features, which contain nine features proposed in this work, the performances of all classification methods are significantly improved, indicating that the proposed features are more effective for characterizing speech disorder using speech LMs.

It is worth noting that small sample size has always been a limitation in biomedical studies due to labor, time, and other constraints. However, with careful separation of model training (including feature selection and hyperparameter tuning) and testing (using samples completely left out of the model training process), significant conclusions can be drawn from analyses based on small number of samples. In this regard, MCVT is a robust technique for comparing different modeling techniques and assessing their performances with small number of test samples. Feature selection is also an effective way to avoid overfitting and reduce test variance, especially when there are more features than observations as in this study. Finally, SMOTE can help alleviate the problem by generating synthetic samples. A word of caution is that the above procedures ought to be limited to the training process on the training samples only to avoid overfitting by feature selection using test samples and bias introduced by the artificial samples.

Data Availability

All data are in the manuscript and/or supporting information files.

Funding Statement

Auburn University Office of the Vice President for Research & Economic Development Innovation Award (MSA). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Prelock PA, Hutchins T, Glascoe FP. Speech-language impairment: How to identify the most common and least diagnosed disability of childhood. The Medscape Journal of Medicine. 2008;10(6):136. [Google Scholar]
2.American Academy of Pediatrics. Bright Futures. Elk Grove Village, IL: American Academy of Pediatrics; 1990. [Google Scholar]
3.American Academy of Pediatrics. Erratum: Council on Children with Disabilities, Section on Developmental Behavioral Pediatrics, Bright Futures Steering Committee, Medical Home Initiatives for Children with Special Needs Project Advisory Committee. Vol. 118, Pediatrics. 2006. p. 1808–9. [Google Scholar]
4.Rosenbaum S, Simon P. Speech and Language Disorders in Children: Implications for the Social Security Administration’s Supplemental Security Income Program. Speech and Language Disorders in Children: Implications for the Social Security Administration’s Supplemental Security Income Program. 2016. 1–287 p. doi: 10.17226/21872 [DOI] [PubMed] [Google Scholar]
5.Nelson HD, Nygren P, Walker M, Panoscha R. Erratum: Screening for speech and language delay in preschool children: Systematic evidence review for the US Preventive Services Task Force (Pediatrics (2006) 117, (e298-e319)). Pediatrics. 2006;117(6):298–318. [DOI] [PubMed] [Google Scholar]
6.Ishikawa K, MacAuslan J, Boyce S. Toward clinical application of landmark-based speech analysis: Landmark expression in normal adult speech. J Acoust Soc Am. 2017;142(5):EL441—EL447. doi: 10.1121/1.5009687 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Tyler AA, Tolbert LC. Speech-language assessment in the clinical setting. 2002; [Google Scholar]
8.Stoel-Gammon C. Transcibing the speech of young children. Top Lang Disord. 2001;21(4):12–21. [Google Scholar]
9.Ball MJ, Rahilly J. Transcribing disordered speech: The segmental and prosodic layers. Clin Linguist Phon. 2002;16(5):329–44. doi: 10.1080/02699200210135866 [DOI] [PubMed] [Google Scholar]
10.Berisha V, Utianski R, Liss J. Towards a clinical tool for automatic intelligibility assessment. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2013. p. 2825–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Stevens KN. Toward a model for lexical access based on acoustic landmarks and distinctive features. J Acoust Soc Am. 2002. Apr;111(4):1872–91. doi: 10.1121/1.1458026 [DOI] [PubMed] [Google Scholar]
12.Liu SA. Landmark detection for distinctive feature-based speech recognition. J Acoust Soc Am. 1996. Nov;96(5):3227–3227. [Google Scholar]
13.Howitt AW. Automatic syllable detection for vowel landmarks. 2000; [Google Scholar]
14.Hansen JHL, Patil S. Speech under stress: Analysis, modeling and recognition. In: Speaker classification I. Springer; 2007. p. 108–37. [Google Scholar]
15.Huang Z, Epps J, Joachim D. Investigation of Speech Landmark Patterns for Depression Detection. IEEE Transactions on Affective Computing. 2019;1–1. [Google Scholar]
16.Dai K, Fell HJ, MacAuslan J. Recognizing emotion in speech using neural networks. Telehealth and Assistive Technologies. 2008;31:38–43. [Google Scholar]
17.New, TL, Li, H, Dong, M. Analysis and detection of speech under sleep deprivation. In: Ninth International Conference on Spoken Language Processing. 2006.
18.Speights Atkins, M, Boyce, SE, Macauslan, J, Silbert, N. Computer-Assisted Syllable Complexity Analysis of Continuous Speech as a Measure of Child Speech Disorders. In: Proceedings of the International Congress of Phonetic Science. 2019. p. 3–7.
19.Boyce, S, Fell, H, Wilde, L, Macauslan, J. Automated tools for identifying syllabic landmark clusters that reflect changes in articulation. In: Models and Analysis of Vocal Emissions for Biomedical Applications - 7th International Workshop, MAVEBA 2011. 2011. p. 63–6.
20.Chenausky K, MacAuslan J, Goldhor R. Acoustic analysis of PD speech. Parkinson’s Disease. 2011;2011. doi: 10.4061/2011/435232 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Park C. Consonant landmark detection for speech recognition. Massachusetts Institute of Technology; 2008. doi: 10.1121/1.2823754 [DOI] [PubMed] [Google Scholar]
22.Huang Z, Epps J, Joachim D. Speech landmark bigrams for depression detection from naturalistic smartphone speech. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2019. p. 5856–60. [Google Scholar]
23.Huici HD, Kairuz HA, Martens H, van Nuffelen G, de Bodt M. Speech rate estimation in disordered speech based on spectral landmark detection. Biomed Signal Process Control. 2016;27:1–6. [Google Scholar]
24.Trevino AC, Quatieri TF, Malyska N. Phonologically-based biomarkers for major depressive disorder. EURASIP Journal on Advances in Signal Processing. 2011;2011(1):1–18. [Google Scholar]
25.Speights Atkins M, Bailey DJ, Boyce SE. Speech exemplar and evaluation database (SEED) for clinical training in articulatory phonetics and speech science. Taylor & Francis. 2020. Sep;34(9):878–86. doi: 10.1080/02699206.2020.1743761 [DOI] [PubMed] [Google Scholar]
26.Wiig EH, Secord WA, Semel E. CELF-Preschool-2: Clinical Evaluation of Language Fundamentals, Preschool. Harcourt Assessment; 2004. [Google Scholar]
27.Dodd B, Zhu H, Crosbie S, Holm A, Ozanne A. Diagnostic evaluation of articulation and phonology (DEAP). Psychology Corporation; 2002. [Google Scholar]
28.Anderson C, Cohen W. Measuring word complexity in speech screening: Single-word sampling to identify phonological delay/disorder in preschool children. International Journal of Language and Communication Disorders. 2012. Sep;47(5):534–41. doi: 10.1111/j.1460-6984.2012.00163.x [DOI] [PubMed] [Google Scholar]
29.Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intelligent data analysis. 2002;6(5):429–49. [Google Scholar]
30.Chawla N V, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–57. [Google Scholar]
31.Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research. 2017; [Google Scholar]
32.Shah D, Wang J, He QP. A feature-based soft sensor for spectroscopic data analysis. Journal of Process Control. 2019;78:98–107. [Google Scholar]
33.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011; [Google Scholar]
34.Friedman J, Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Elements. Springer series in statistics New York; 2001. [Google Scholar]
35.James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. Vol. 112. Springer; 2013. [Google Scholar]
36.Lau KW, Wu QH. Online training of support vector classifier. Pattern Recognition. 2003;36(8):1913–20. [Google Scholar]
37.Chang CC, Lin CJ. LIBSVM: A Library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011; [Google Scholar]
38.Breiman L. Random forests. Machine Learning. 2001; [Google Scholar]
39.Chen, T, Guestrin, C. XGBoost: A scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000041.r001

Decision Letter 0

Henry Horng-Shing Lu

22 Feb 2022

PDIG-D-22-00004

Feature Engineering and Machine Learning for Computer-assisted Screening of Children with Speech Disorders

PLOS Digital Health

Dear Dr. He,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Apr 23 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Henry Horng-Shing Lu

Section Editor

PLOS Digital Health

Journal Requirements:

1. Please amend your Financial Disclosure statement. If you did not receive any funding for this study, please simply state: “The authors received no specific funding for this work.”

2. Please update your Competing Interests statement. If you have no competing interests to declare, please state: “The authors have declared that no competing interests exist.”

3. Please provide a complete Data Availability Statement in the submission form, ensuring you include all necessary access information or a reason for why you are unable to make your data freely accessible. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. Any potentially identifying patient information must be fully anonymized.

If your research concerns only data provided within your submission, please write “All data are in the manuscript and/or supporting information files.” as your Data Availability Statement.

4. Please provide separate figure files in .tif or .eps format only and remove any figures embedded in your manuscript file. Please ensure that all files are under our size limit of 20MB.

For more information about how to convert your figure files please see our guidelines: https://journals.plos.org/digitalhealth/s/figures

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors proposed a computer-assisted approach for children speech disorders screening. They used the SpeechMark software to extract speech landmarks as features, based on which some standard feature engineering strategies (e.g., n-gram) and feature selection methods were applied. They also employed SMOTE method to handle the class imbalance issue. Standard machine learning algorithms (e.g., LDA, SVM, XGBoost, RF) were then trained based on those engineered features. The methods in this work are generally sound and they showed significant improvement in model performance with their feature engineering and selection strategies. However, there are a few concerns as listed below.

1. It is not clear if the authors had a separate testing set. It seemed that they used cross-validation or leave-one-out methods for feature selection, handled sample imbalance, and tuned hyper-parameters, all on the whole dataset. If they did not have a testing set separated before employing all the engineering and tuning, the models would likely be overfitted. This does not show in current results, but will show when applied to new data that the models have not seen. The authors should at least try to separate a testing set of about 20% of the data, do all the feature engineering and training on the rest, and test the models on the testing set.

2. Another shortcoming of this work is the limited data size – only from 51 children, which makes it hard to draw any significant conclusions.

3. A related work using SpeechMark for landmarks extraction and machine learning algorithms for depression detection should be cited and compared. https://ieeexplore-ieee-org.ezproxy.lib.nctu.edu.tw/abstract/document/8682916

A few minor points:

1. There are duplicated references: 18 and 21, 19 and 22.

2. Figure 8 is not a standard ROC curve. Please refer to Figure 4 and link the left-most point to the origin and the right-most point to the top-right corner. Do not plot with circles. May consider labelling data values so the readers can see clearly the key points.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. 2022 May 26;1(5):e0000041. doi: 10.1371/journal.pdig.0000041.r002

Author response to Decision Letter 0

4 Apr 2022

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(331.6KB, docx)}

PLOS Digit Health. doi: 10.1371/journal.pdig.0000041.r003

Decision Letter 1

Henry Horng-Shing Lu

8 Apr 2022

Feature Engineering and Machine Learning for Computer-assisted Screening of Children with Speech Disorders

PDIG-D-22-00004R1

Dear Prof. He,

We are pleased to inform you that your manuscript 'Feature Engineering and Machine Learning for Computer-assisted Screening of Children with Speech Disorders' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Henry Horng-Shing Lu

Section Editor

PLOS Digital Health

***********************************************************

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

Reviewer #1: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have addressed all the issues raised in the previous review to satisfaction. For a minor suggestion, the authors are encouraged to check the tense in reporting their design, results, etc. - a past tense should be used.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(331.6KB, docx)}

Data Availability Statement

All data are in the manuscript and/or supporting information files.

[pdig.0000041.ref001] 1.Prelock PA, Hutchins T, Glascoe FP. Speech-language impairment: How to identify the most common and least diagnosed disability of childhood. The Medscape Journal of Medicine. 2008;10(6):136. [Google Scholar]

[pdig.0000041.ref002] 2.American Academy of Pediatrics. Bright Futures. Elk Grove Village, IL: American Academy of Pediatrics; 1990. [Google Scholar]

[pdig.0000041.ref003] 3.American Academy of Pediatrics. Erratum: Council on Children with Disabilities, Section on Developmental Behavioral Pediatrics, Bright Futures Steering Committee, Medical Home Initiatives for Children with Special Needs Project Advisory Committee. Vol. 118, Pediatrics. 2006. p. 1808–9. [Google Scholar]

[pdig.0000041.ref004] 4.Rosenbaum S, Simon P. Speech and Language Disorders in Children: Implications for the Social Security Administration’s Supplemental Security Income Program. Speech and Language Disorders in Children: Implications for the Social Security Administration’s Supplemental Security Income Program. 2016. 1–287 p. doi: 10.17226/21872 [DOI] [PubMed] [Google Scholar]

[pdig.0000041.ref005] 5.Nelson HD, Nygren P, Walker M, Panoscha R. Erratum: Screening for speech and language delay in preschool children: Systematic evidence review for the US Preventive Services Task Force (Pediatrics (2006) 117, (e298-e319)). Pediatrics. 2006;117(6):298–318. [DOI] [PubMed] [Google Scholar]

[pdig.0000041.ref006] 6.Ishikawa K, MacAuslan J, Boyce S. Toward clinical application of landmark-based speech analysis: Landmark expression in normal adult speech. J Acoust Soc Am. 2017;142(5):EL441—EL447. doi: 10.1121/1.5009687 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000041.ref007] 7.Tyler AA, Tolbert LC. Speech-language assessment in the clinical setting. 2002; [Google Scholar]

[pdig.0000041.ref008] 8.Stoel-Gammon C. Transcibing the speech of young children. Top Lang Disord. 2001;21(4):12–21. [Google Scholar]

[pdig.0000041.ref009] 9.Ball MJ, Rahilly J. Transcribing disordered speech: The segmental and prosodic layers. Clin Linguist Phon. 2002;16(5):329–44. doi: 10.1080/02699200210135866 [DOI] [PubMed] [Google Scholar]

[pdig.0000041.ref010] 10.Berisha V, Utianski R, Liss J. Towards a clinical tool for automatic intelligibility assessment. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2013. p. 2825–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000041.ref011] 11.Stevens KN. Toward a model for lexical access based on acoustic landmarks and distinctive features. J Acoust Soc Am. 2002. Apr;111(4):1872–91. doi: 10.1121/1.1458026 [DOI] [PubMed] [Google Scholar]

[pdig.0000041.ref012] 12.Liu SA. Landmark detection for distinctive feature-based speech recognition. J Acoust Soc Am. 1996. Nov;96(5):3227–3227. [Google Scholar]

[pdig.0000041.ref013] 13.Howitt AW. Automatic syllable detection for vowel landmarks. 2000; [Google Scholar]

[pdig.0000041.ref014] 14.Hansen JHL, Patil S. Speech under stress: Analysis, modeling and recognition. In: Speaker classification I. Springer; 2007. p. 108–37. [Google Scholar]

[pdig.0000041.ref015] 15.Huang Z, Epps J, Joachim D. Investigation of Speech Landmark Patterns for Depression Detection. IEEE Transactions on Affective Computing. 2019;1–1. [Google Scholar]

[pdig.0000041.ref016] 16.Dai K, Fell HJ, MacAuslan J. Recognizing emotion in speech using neural networks. Telehealth and Assistive Technologies. 2008;31:38–43. [Google Scholar]

[pdig.0000041.ref017] 17.New, TL, Li, H, Dong, M. Analysis and detection of speech under sleep deprivation. In: Ninth International Conference on Spoken Language Processing. 2006.

[pdig.0000041.ref018] 18.Speights Atkins, M, Boyce, SE, Macauslan, J, Silbert, N. Computer-Assisted Syllable Complexity Analysis of Continuous Speech as a Measure of Child Speech Disorders. In: Proceedings of the International Congress of Phonetic Science. 2019. p. 3–7.

[pdig.0000041.ref019] 19.Boyce, S, Fell, H, Wilde, L, Macauslan, J. Automated tools for identifying syllabic landmark clusters that reflect changes in articulation. In: Models and Analysis of Vocal Emissions for Biomedical Applications - 7th International Workshop, MAVEBA 2011. 2011. p. 63–6.

[pdig.0000041.ref020] 20.Chenausky K, MacAuslan J, Goldhor R. Acoustic analysis of PD speech. Parkinson’s Disease. 2011;2011. doi: 10.4061/2011/435232 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pdig.0000041.ref021] 21.Park C. Consonant landmark detection for speech recognition. Massachusetts Institute of Technology; 2008. doi: 10.1121/1.2823754 [DOI] [PubMed] [Google Scholar]

[pdig.0000041.ref022] 22.Huang Z, Epps J, Joachim D. Speech landmark bigrams for depression detection from naturalistic smartphone speech. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2019. p. 5856–60. [Google Scholar]

[pdig.0000041.ref023] 23.Huici HD, Kairuz HA, Martens H, van Nuffelen G, de Bodt M. Speech rate estimation in disordered speech based on spectral landmark detection. Biomed Signal Process Control. 2016;27:1–6. [Google Scholar]

[pdig.0000041.ref024] 24.Trevino AC, Quatieri TF, Malyska N. Phonologically-based biomarkers for major depressive disorder. EURASIP Journal on Advances in Signal Processing. 2011;2011(1):1–18. [Google Scholar]

[pdig.0000041.ref025] 25.Speights Atkins M, Bailey DJ, Boyce SE. Speech exemplar and evaluation database (SEED) for clinical training in articulatory phonetics and speech science. Taylor & Francis. 2020. Sep;34(9):878–86. doi: 10.1080/02699206.2020.1743761 [DOI] [PubMed] [Google Scholar]

[pdig.0000041.ref026] 26.Wiig EH, Secord WA, Semel E. CELF-Preschool-2: Clinical Evaluation of Language Fundamentals, Preschool. Harcourt Assessment; 2004. [Google Scholar]

[pdig.0000041.ref027] 27.Dodd B, Zhu H, Crosbie S, Holm A, Ozanne A. Diagnostic evaluation of articulation and phonology (DEAP). Psychology Corporation; 2002. [Google Scholar]

[pdig.0000041.ref028] 28.Anderson C, Cohen W. Measuring word complexity in speech screening: Single-word sampling to identify phonological delay/disorder in preschool children. International Journal of Language and Communication Disorders. 2012. Sep;47(5):534–41. doi: 10.1111/j.1460-6984.2012.00163.x [DOI] [PubMed] [Google Scholar]

[pdig.0000041.ref029] 29.Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intelligent data analysis. 2002;6(5):429–49. [Google Scholar]

[pdig.0000041.ref030] 30.Chawla N V, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–57. [Google Scholar]

[pdig.0000041.ref031] 31.Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research. 2017; [Google Scholar]

[pdig.0000041.ref032] 32.Shah D, Wang J, He QP. A feature-based soft sensor for spectroscopic data analysis. Journal of Process Control. 2019;78:98–107. [Google Scholar]

[pdig.0000041.ref033] 33.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011; [Google Scholar]

[pdig.0000041.ref034] 34.Friedman J, Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Elements. Springer series in statistics New York; 2001. [Google Scholar]

[pdig.0000041.ref035] 35.James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. Vol. 112. Springer; 2013. [Google Scholar]

[pdig.0000041.ref036] 36.Lau KW, Wu QH. Online training of support vector classifier. Pattern Recognition. 2003;36(8):1913–20. [Google Scholar]

[pdig.0000041.ref037] 37.Chang CC, Lin CJ. LIBSVM: A Library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011; [Google Scholar]

[pdig.0000041.ref038] 38.Breiman L. Random forests. Machine Learning. 2001; [Google Scholar]

[pdig.0000041.ref039] 39.Chen, T, Guestrin, C. XGBoost: A scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.

PERMALINK

Feature engineering and machine learning for computer-assisted screening of children with speech disorders

Kerul Suthar

Farnaz Yousefi Zowj

Marisha Speights Atkins

Q Peter He

Roles

Abstract

1 Introduction

Table 1. Description of landmarks used in this study.

Fig 1. Detected landmarks (LMs) using SpeechMark.

2 Materials and methods

2.1 Ethical considerations

2.2 Speakers

2.3 Dataset

2.4 Feature engineering

Table 2. Features employed in this study–including features adopted from literature and features proposed in this work.

2.5 Feature selection

Fig 2. Recursive feature elimination with cross-validation (RFECV).

Table 3. The ten features selected based on RFECV.

2.6 Sample imbalance

2.7 Synthetic minority over-sampling technique (SMOTE)

2.8 Monte-Carlo cross validation and testing (MCVT)

Fig 3. Schematic of the MCVT for comparing different modeling techniques and assessing their performances in terms of accuracy and robustness.

2.9 The trade-off between sensitivity and specificity

Fig 4. The trade-off between sensitivity and specificity illustrated by a receiver operator characteristic (ROC) curve.

2.10 Classification techniques

2.10.1 Linear discriminant analysis

2.10.2 Support vector machine

2.10.3 Random forest

2.10.4 Extreme gradient boosting (XGBoost)

Results and discussion

Table 4. Comparison of classification performance based on raw features.

Fig 5. Comparison of classification performance based on raw features.

Table 5. Comparison of classification performance based on rationally engineered and selected features.

Fig 6. Comparison of classification performance when selected features are used.

Fig 7. The impact of class priors on sensitivity, specificity and accuracy of the LDA classifier.

Table 6. LDA classification performance when different class priors are used.

Fig 8. ROC curve of the LDA classifier.

4. Conclusion

Data Availability

Funding Statement

References

Decision Letter 0

Henry Horng-Shing Lu

Roles

Author response to Decision Letter 0

Decision Letter 1

Henry Horng-Shing Lu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases