Skip to main content
Karger Author's Choice logoLink to Karger Author's Choice
. 2025 Feb 21;77(5):480–494. doi: 10.1159/000544852

Screening Voice Disorders: Acoustic Voice Quality Index, Cepstral Peak Prominence, and Machine Learning

Ahmed M Yousef a,b,c,, Adrián Castillo-Allendes c,d, Mark L Berardi c, Juliana Codino e, Adam D Rubin e, Eric J Hunter c
PMCID: PMC12353333  PMID: 39987907

Abstract

Introduction

The Acoustic Voice Quality Index (AVQI) and Smoothed Cepstral Peak Prominence (CPPs) have been reported to effectively support the assessment of voice quality in persons seeking voice care across many languages. This study aimed to evaluate the diagnostic accuracy of these two measures in detecting voice disorders in American English speakers, comparing their performance to machine learning (ML) models.

Methods

This retrospective study included a cohort of 187 participants: 138 patients with clinically diagnosed voice disorders and 49 vocally healthy individuals. Each participant completed two voicing tasks: sustaining [a:] vowel and producing a running speech sample, which were then concatenated. These samples were analyzed using VOXplot software for AVQI-3 (version 03.01) and CPPs. Additionally, four ML models (random forest, k-nearest neighbors, support vector machine, and decision tree) were trained for comparison. The diagnostic accuracy of the two measures and models was assessed using various evaluation metrics, including receiver operating characteristic curve and Youden Index.

Results

A cutoff score of 1.54 for the AVQI-3 (with 55% sensitivity and 80% specificity) and 14.35 dB for CPPs (with 65% sensitivity and 78% specificity) were identified for detecting voice disorders. Compared to an average ML sensitivity of 89% and specificity of 55%, CPPs offered a better balance between sensitivity and specificity, outperforming AVQI-3 and nearly matching the average ML performance.

Conclusions

ML shows great potential for supporting voice disorder diagnostics, especially as models become more generalizable and easier to interpret. However, current tools like AVQI-3 and CPPs remain more practical and accessible for clinical use in evaluating voice quality than commonly implemented models. CPPs, in particular, offers distinct advantages for identifying voice disorders, making it a recommended and feasible choice for clinics with limited resources.

Keywords: Voice disorders, Machine learning, Speech acoustics, Acoustic Voice Quality Index, Cepstral Peak Prominence

Introduction

Voice disorders are common, with studies showing that approximately 10% of US adults experienced a vocal health problem in 2012 [1, 2]. Several clinical tools exist to support the assessment, diagnosis, and tracking of voice disorders and vocal health changes, including self-reported questionnaires [3], perceptual evaluations of voice [4, 5], aerodynamic measurements [6, 7], laryngeal imaging [811], and acoustic voice analysis [1216]. While laryngeal imaging is the gold standard, acoustic analysis of voice, compared to the other options, offers a highly tolerable, non-invasive, objective, and cost-effective assessment option, making it highly accessible for voice clinicians, voice researchers, and patients. Acoustic voice analysis involves processing the vocal audio signal using signal processing techniques to extract key acoustic features of voice and speech. These acoustic features can serve as digital markers of vocal health. Commonly used and previously reported features offer a detailed view of the characteristics of voice production [17], including fundamental frequency (related to the perception of pitch), intensity (related to the perception of loudness), jitter (frequency fluctuations in vocal fold vibrations), shimmer (amplitude variations in vocal intensity), and noise-to-harmonics ratio (the proportion of noise-to-harmonic components in the voice signal). However, many of these measures heavily rely on estimating the fundamental frequency from the signal, which can be hard to estimate in severely dysphonic voices [18, 19].

Among acoustic voice parameters, Cepstral Peak Prominence (CPP) has received significant attention as a potential measure for evaluating overall voice quality without requiring a reliable estimate of fundamental frequency [20, 21]. In other words, CPP assesses how regular the voice signal is without the need to calculate the fundamental frequency [16]. CPP, and its variants such as Smoothed CPP (CPPs) and Cepstral Spectral Index of Dysphonia (CSID), can be calculated from both sustained vowels and connected speech. Studies have consistently shown a strong correlation between CPP, CPPs, and CSID values and perceptual judgments of voice quality across various languages, making these measures effective for detecting dysphonia [2227]. Lower CPPs values, measured in decibels, are typically associated with dysphonic voices (e.g., voices that are less regular, noisier, or less periodic). Additionally, CPP and CPPs have been implemented in mobile devices for clinical screening and for monitoring treatment adherence and progress [15, 28].

The goal of CPP and the other acoustic parameters is to gain a better understanding of vocal health and vocal production. Ideally, each acoustic metric would capture a unique feature or element of the sound signal or tie it to particular voice problems [29]. However, our understanding of these connections remains limited. To further enhance the value of these measurements, multiparametric approaches have emerged, combining several acoustic parameters into a single index for a more comprehensive assessment. Examples include the Dysphonia Severity Index (DSI) [30], Acoustic Voice Quality Index (AVQI) [31], and Acoustic Breathiness Index (ABI) [32]. Linearly combining multiple parameters enhances usability by capturing a broad range of voice problems in a single metric, making it an efficient and attractive option for voice clinics. Focusing on the AVQI, it was developed as a regression model to predict clinical perceptual ratings of overall voice quality [33]. AVQI has been implemented across multiple languages, with studies reporting single-value thresholds that support its clinical utility in voice quality assessment and its ability to differentiate voices with and without voice disorders [3438].

Another multiparametric approach gaining attention, alongside traditional indexes, is artificial intelligence and machine learning (ML), which is inherently a multiparametric approach, offering a powerful and innovative method for enhancing voice assessment [39, 40]. Furthermore, in contrast to traditional regression models, ML approaches can capture nonlinear relationships between parameters, enabling more complex feature interactions. Various ML models have been applied for classifying healthy and disordered voices, such as artificial neural networks, Naïve Bayes, decision trees (DTs), support vector machines (SVMs), random forests (RFs), Gaussian mixture models, extreme learning machines, and convolutional neural networks, showing potential in voice disorder classification across multiple studies [39, 4145]. Despite the progress that has been made by these advanced artificial intelligence-driven approaches, ML models for voice disorder detection still face limitations. Many algorithms struggle with classification accuracy, slow processing, and high resource demands [46]. Additionally, some models require full retraining when new data are introduced, causing delays. A major challenge is that many models rely on small datasets, which can lead to reduced generalizability and limited robustness in real-world clinical applications [41, 47].

Overall, while ML models offer advanced capabilities, they inherently come with challenges [48]. In contrast, acoustic measures like CPPs are accessible and require minimal resources. Indexes like AVQI can fit in-between the broad complexity of ML and a single measure like CPPs in accessibility and ease of use. This raises a question: Can existing and accessible acoustic metrics collected in a clinical setting provide adequate voice assessment support and disorder discrimination compared to more advanced and sophisticated methods like ML models? To address this question, this study aimed to compare the effectiveness of three levels of analysis – CPPs, the more complex AVQI, and various ML models – in identifying American English speakers with voice disorders. Among AVQI versions, AVQI-3 was selected as it is the most updated and validated version, showing a strong correlation with perceptual ratings of overall voice quality severity [49]. Unlike other acoustic-based indices such as ABI, which primarily assess breathiness, AVQI-3 provides a more comprehensive assessment of overall voice quality, making it a practical choice for quantifying voice abnormalities. Based on this, we hypothesize that AVQI-3, by capturing multiple acoustic features relevant to voice quality, will demonstrate superior discrimination of voice disorders compared to single-feature measures like CPPs, while also serving as an interpretable alternative to ML models.

Methods

This retrospective study of existing patient and healthy adult voice recordings was conducted to compare the performance of CPPs, AVQI, and ML models in screening clinically diagnosed voice disorders. This methodology builds on a prior study that focused on validating AVQI scores with perceptual assessment of voice quality in the American English language [49]. The current protocol examined audio recordings, including both sustained vowels and running speech samples, from vocally healthy individuals and patients presenting with various voice problems. Temporal and spectral acoustic measurements were extracted from the voice signals to calculate the AVQI score and CPPs. These metrics are also used as input features to train multiple ML models. A comparison was then made between the classification accuracy of AVQI and CPPs versus the ML models in detecting voice disorders using the same voice samples. A detailed description of each step in the methodology is outlined below.

Data Collection

The present study considered a retrospective cohort of 187 participants (native speakers of American English): 138 patients with voice disorders (95 females and 43 males) and 49 individuals without voice problems (29 females and 20 males). The patients were diagnosed with various voice disorders including atrophy, nodules, scars, Reinke’s edema, laryngeal dystonia, paralysis, muscle tension dysphonia, and other voice disorders. The healthy individuals reported no history of vocal problems or voice disorders and were carefully chosen to match the age of the recruited patients. All participants performed two voicing tasks: phonating the [a:] vowel and reading the first three sentences of the “Rainbow Passage” [50]. Audio data were collected in a sound booth using a head-mounted microphone, connected to an audio interface, and recorded at 44.1 kHz as uncompressed WAV files. For detailed information on the diagnoses and voice severity of the participants, the perceptual severity of all voice samples in this study was previously described in Castillo-Allendes et al. [49], in which a panel of six voice specialists evaluated each recording and rated the severity of each subject’s voice. Further details, including data collection, recording equipment, and demographic information of the subjects, are also available in [49].

Acoustic Measures: CPPs and AVQI-3

CPPs and the AVQI-3 index were generated [49, 51], using a 3-s voice segment from the Rainbow Passage (first 22 syllables), concatenated with a 3-s sustained vowel [a:] [52]. Six acoustic measurements, including CPPs, were obtained from this concatenated speech sample and used to compute the AVQI-3 metric using the equation introduced by Barsties and Maryn [52].

AVQI3=4.1520.177×CPPs0.006×HNR0.037×Shim+0.941×ShdB+0.01×Slope+0.093×Tilt×2.8902

In the AVQI-3 regression equation, Shim and ShdB refer to Shimmer local and Shimmer in dB, HNR is the harmonics-to-noise ratio, Slope indicates the spectrum’s overall slope, and Tilt corresponds to the tilt of the regression line fitted into the spectrum. The resulting AVQI-3 values typically range from 0 to 10 with higher values indicating a greater degree of dysphonia.

The six acoustic measures contributing to the calculation of the AVQI-3 were obtained from VOXplot version 2.0.0 (Lingphon, Straubenhardt, Germany), a free software that integrates Praat scripts. An example of the output is shown in Figure 1, which illustrates the visualization and analysis of sustained vowel, voiced connected speech, and a concatenated sample. The output highlights the relevant metrics used in this analysis, with the six measures in dark red and the associated AVQI-3 score in bright red. After obtaining AVQI-3 for each subject, an optimal threshold was then defined for the AVQI-3 to maximize discrimination between healthy and dysphonic voices using different performance metrics, which are discussed in the “Evaluation Metrics” section.

Fig. 1.

Fig. 1.

VOXplot visualization showing sustained vowel [a:] and continuous speech as input to the software (left panel). The right panel includes the outcome analysis showing the resulted acoustic voice measurements. SV refers to a sustained vowel, CS to voiced connected speech, and MX to a concatenated sample combining both. The red stars and the highlighted boxes indicate the acoustic features (using concatenated samples) considered in the present article.

ML Models

The six acoustic measurements that make up the AVQI index were also used as acoustic features to build ML models. These models extend beyond traditional linear regression (which was used in developing AVQI) by leveraging complex patterns and interactions between these acoustic features, thereby improving diagnostic accuracy. For that purpose, these six features were normalized using the min-max normalization method to scale all acoustic feature values within the range [0, 1], enhancing efficacy of the ML models. The classification accuracy of the AVQI-3 was compared against the accuracy of different supervised ML classifiers. Four popular ML models for data classification were considered: k-nearest neighbors (k-NN), SVM, DT, and RF. All the models were trained as binary classifiers to discriminate between clinically diagnosed voice disorder audio sample versus healthy audio sample.

To build, train, and evaluate each ML algorithm as a binary classifier on the present voice data, the Python programming language (version 3.12.2) and the scikit-learn library (version 1.5.0) were used for implementation. The default values were considered for the hyperparameters associated with each model [53]. After building the four classifiers, the six normalized acoustic features related to the AVQI-3 index of each subject created a dataset to train and test the ML models. The formed dataset consisted of the six features and the corresponding labels, where 1 represented voice disorder samples and 0 represented healthy samples. The dataset was divided into two subsets: a training subset comprising 80% of the voice samples (149 samples) and a testing subset comprising 20% of the voice samples (38 samples). The dataset was randomly shuffled and stratified. This ensured that the training and testing datasets maintained the same class distribution as the original dataset. It allowed the minority class (healthy samples) to be present in both the training and test sets and helped improve overall model evaluation accuracy. Using the acoustic measures from the training subset, the classifiers were trained to distinguish between two classes: voice disorder and healthy samples. All the trained models were then tested on the test subset to evaluate their classification performance. The performance test was carried out based on several evaluation metrics for a proper comparison among the models and between the models and both AVQI-3 and CPPs. Each evaluation metric is discussed in detail in the following subsection.

Evaluation Metrics

A comparison of ML with both AVQI-3 and CPPs was conducted to evaluate their effectiveness in classifying dysphonic versus healthy voices. Different evaluation metrics were used, including accuracy, sensitivity, specificity, Youden Index, likelihood ratios (LR), receiver operating characteristic curve (ROC), and area under ROC curve (AUC). Below is a summary of the definition and the computation of each evaluation metric.

Accuracy represents the ratio of the correct model classifications to the total number of model classifications. The following equation was used to calculate the accuracy:

Accuracy=TP+TNTP+TN+FP+FN

TP (true positive) refers to the correctly classified dysphonia samples. FP (false positive) refers to dysphonic samples that were misclassified as healthy voices. TN (true negative) is the correctly classified healthy samples. FN (false negative) indicates the misclassified healthy samples as dysphonia.

Sensitivity measures the effectiveness of the model or the index to correctly predict dysphonic audio samples (or positive cases). Sensitivity was computed based on the values of TP and FN using the following mathematical formula:

Sensitivity=TPTP+FN

Specificity is similar to sensitivity, but it quantifies the ability of the model to correctly predict healthy audio samples (or the negative cases). This metric depends on TN and FP using the following mathematical formula:

Specificity=TNTN+FP

Youden Index (sensitivity + specificity − 1) was used to account for both sensitivity and specificity. In addition to Youden Index, the LR was calculated, which also accounts for both sensitivity and specificity but less impacted by the imbalance in the sample size between healthy subjects and patients. We computed the LR for positive (LR+) and negative (LR−) results. LR+ estimates the possibility that a voice sample is dysphonic when the diagnosis outcome is positive: sensitivity/(1 − specificity). LR− estimates the possibility that a voice sample is healthy when the diagnosis outcome is negative: (1 − sensitivity)/specificity. ROC is another performance metric that measures the accuracy of a classification model. This measure refers to a plot between the sensitivity metric and the specificity metric (1 – specificity). The area under this plotted curve (AUC) indicates the accuracy of the classifier: the larger the AUC, the greater the classifying accuracy of the model. That is, a perfect classification would yield an AUC of 1.

By analyzing the ROC curve, the optimal cutoff scores associated with AVQI-3 and CPPs were determined based on the maximum Youden Index. To evaluate the clinical applicability of these cutoffs, a balance between LR+ and LR− is important. High diagnostic performance is indicated by an LR+ greater than 10 and an LR− less than 0.1 [38]. In addition to the quantitative evaluation scores, the ROC plot was also used for visual comparison.

Results

The results for the six acoustic measurements, including CPPs, and the AVQI-3 metric are presented in Table 1 for both the healthy and voice-disordered groups. The mean, standard deviation (SD), minimum value, and maximum value are listed for each measure. In general, the voice-disordered group shows more variability and discrepancies in the mean values, SD and the ranges between the minimum and maximum values compared to the normal control in most of the acoustic measurements. For example, the slope has large negative values in the voice-disordered group, with a mean of −21.32 compared to −17.57 in the normal group. The tilt measure does not demonstrate a noticeable difference between the two groups across all statistical metrics. The CPPs value reduces by approximately 16.5% in the voice-disordered group (mean 12.84) compared to the normal group (mean 15.39). The minimum CPPs value in the voice-disordered group (4.88) is notably lower compared to the normal group (9.88). Shimmer measurements, both in percentage and dB, are relatively elevated in the voice-disordered group: Shim, in percentage, increases by 23.7% (5.96% vs. 4.82%) and shdB, in dB, is higher by 21.7% (0.56 dB vs. 0.46 dB). HNR mean value is marginally lower in the voice-disordered group yet the range between the minimum and maximum values is remarkably larger than in the normal control. Finally, the AVQI-3 metric demonstrates a higher value in the voice-disordered group, with an average score of 2.16 compared to 0.83 in the normal group, and a wider range (−0.91 to 8.79) compared to the normal group (−2.42 to 4.74). Across all measurements, SD values are higher in the voice-disordered group, which highlights the greater variability across patients.

Table 1.

Descriptive statistics of the six acoustic measurements used for AVQI-3 computation for normal and voice-disordered groups

Measurements Healthy voice group Diagnosed voice-disordered group
mean SD min max mean SD min max
Slope −17.57 3.92 −25.66 −8.35 −21.32 4.44 −33.17 −7.68
Tilt −11.95 1.01 −13.86 −8.97 −12.01 1.06 −14.04 −9.08
CPPs 15.39 2.16 9.88 22.47 12.84 2.81 4.88 17.58
Shimmer 4.82 2.05 1.95 12.02 5.96 2.97 1.87 20.24
Shimmer dB 0.46 0.18 0.22 1.05 0.56 0.27 0.26 1.80
HNR 19.09 3.69 10.24 26.07 18.67 5.10 4.60 27.76
AVQI-3 0.83 1.40 −2.42 4.74 2.16 2.00 −0.91 8.79

The classification performance of each isolated acoustic parameter (Slope, Tilt, CPPs, Shimmer, Shimmer dB, and HNR) used to compute AVQI, along with AVQI itself, is introduced (Figure 2). The ROC curves and AUC values are shown in the figure for a proper visual and quantitative comparison. The x-axis represents 1-specificity, and the y-axis denotes sensitivity. Each line corresponds to a different acoustic metric, with AUC value listed in the legend. A visual inspection shows that the curves closer to the top-left corner, particularly CPPs (in orange) and Slope (in green), indicate better sensitivity and specificity in distinguishing between healthy and voice-disordered samples. This observation is further confirmed by the AUC values, where CPPs (AUC = 0.77) and Slope (AUC = 0.73) achieve the highest AUC values. The AVQI curve in blue follows a lower trajectory with smaller AUC values (0.7) compared to the CPPs and Slope metrics. In contrast, metrics such as HNR (AUC = 0.49) and Tilt (AUC = 0.47) display the weakest classification power and the least AUC values, which appears to be very close to the diagonal, indicating almost a random guess. Shimmer and Shimmer dB metrics exhibit intermediate performance with AUCs of 0.62 and 0.60, respectively. Since CPPs demonstrated the strongest classification performance, it was selected for further analysis and comparison with AVQI-3 against the ML models.

Fig. 2.

Fig. 2.

Receiver operating characteristic (ROC) curves with the area under the curve (AUC) for the Acoustic Voice Quality Index (AVQI) and its six acoustic measurements, including Slope, Tilt, Smoothed Cepstral Peak Prominence (CPPs), Shimmer, Shimmer dB, Harmonics-to-Noise Ratio (HNR).

The discriminatory behavior of AVQI-3 and CPPs in classifying voice samples into healthy and voice-disordered samples is displayed in Figure 3. Along with these two measures, the figure illustrates the ML performance on the testing subset corresponding to each of the four classifiers, which were trained on the six acoustic measurements related to AVQI-3 calculation. Each model and acoustic metric has a unique color to facilitate the comparison. The closer the ROC curve is to the top-left corner, the better the model’s performance in terms of distinguishing between the two classes. The figure demonstrates that most ML classifiers, revealing steeper and higher curves and AUC values, outperform the AVQI-3 which, in contrast, follows a lower trajectory with lower sensitivity/specificity scores and the lowest AUC score (0.7). CPPs exhibited a relatively higher AUC (0.77) and performed better than both AVQI-3 and the DT model, but it still did not match the performance of the other models. Among the ML algorithms, the ROC curves display some differences in performance. As can be seen in the figure, most of the models demonstrated high performance with large sensitivity and specificity (particularly k-NN, SVM, and RF), showing a close trajectory with larger areas under the curves. The DT classifier, however, has the weakest discriminatory performance among the models – in terms of both the visual inspection and the area under its curve, which appears smaller compared to the best-performing classifiers. Interestingly, the DT model exhibits a classification behavior very close to the AVQI-3 metric (with an AUC of 0.71 versus 0.7, respectively).

Fig. 3.

Fig. 3.

Receiver operating characteristic (ROC) curves with the area under the curve for the Acoustic Voice Quality Index (AVQI) and Smoothed Cepstral Peak Prominence (CPPs), compared with four machine learning classifiers – k-nearest neighbors (k-NN), support vector machine (SVM), decision tree (DT), and random forest (RF) – in classifying healthy versus voice-disordered samples.

To determine the classification accuracies associated with CPPs and AVQI-3, the ROC analysis identified cutoff scores that maximized the Youden Index, achieving a balance between sensitivity and specificity. The positive class was the voice disorders group. The optimal thresholds were 1.54 for the AVQI-3 index and 14.35 dB for CPPs. Figures 4 and 5 illustrate the differentiation between vocally healthy individuals and those with voice disorders based on CPPs and AVQI-3 values. The histogram distributions demonstrate how the thresholds separate the two groups with varying effectiveness, reflecting their diagnostic relevance. Visual inspection suggests that CPPs provide better discrimination between healthy individuals and patients than AVQI-3, as indicated by the relatively more distinct group distributions around the defined thresholds (vertical dashed lines).

Fig. 4.

Fig. 4.

Histograms of Smoothed Cepstral Peak Prominence (CPPs) for subjects with voice disorders (orange) and vocally healthy individuals (blue), based on a concatenation of sustained vowels and voiced continuous speech. The y-axis represents the proportion of observations normalized to sum to 1. The vertical black dashed line denotes the CPPs threshold determined using the maximum Youden Index.

Fig. 5.

Fig. 5.

Histograms of AVQI version 03.01 (AVQI-3) for subjects with voice disorders (orange) and vocally healthy individuals (blue), based on a concatenation of sustained vowels and voiced continuous speech. The y-axis represents the proportion of observations normalized to sum to 1. The vertical black dashed line denotes the AVQI-3 threshold determined using the maximum Youden Index.

Table 2 presents the performance metrics comparing the four ML models with both the AVQI-3 index and CPPs using the identified cutoff scores. Comparing the two acoustic measures, CPPs exceeded AVQI-3 in most classification performance metrics, with higher accuracy (0.68 vs. 0.61), sensitivity (0.65 vs. 0.55), and a higherYouden Index (0.43 vs. 0.35). Interestingly, AVQI-3 had the highest specificity (0.8), even against all ML models but at the expense of lower sensitivity, which was the lowest at 0.55. The likelihood ratios further highlighted differences in predictive performance: CPPs and AVQI-3 had the highest LR+ values (2.91 and 2.7, respectively), while RF yielded the lowest LR− (0.12).

Table 2.

Evaluation scores for the ML models, AVQI-3 index (threshold at 1.54), and CPPs (threshold at 14.35 dB): including accuracy sensitivity, specificity, Youden Index, and positive/negative likelihood ratios (LR− and LR+)

Accuracy Sensitivity Specificity Youden Index LR+ LR−
RF 0.84 0.93 0.6 0.53 2.32 0.12
k-NN 0.82 0.89 0.6 0.49 2.23 0.18
SVM 0.79 0.93 0.4 0.33 1.55 0.18
DT 0.76 0.82 0.6 0.42 2.05 0.3
CPPs 0.68 0.65 0.78 0.43 2.91 0.45
AVQI-3 0.61 0.55 0.8 0.35 2.7 0.56

k-NN, k-nearest neighbors; SVM, support vector machine; DT, decision tree; RF, random forest; AVQI, Acoustic Voice Quality Index; CPPs, Smoothed Cepstral Peak Prominence.

Based on the table, the RF model demonstrated the highest overall accuracy (0.84) and sensitivity (0.93), outperforming all other models and the two acoustic measures. k-NN also exhibited accurate classification, achieving an accuracy of 0.82 and sensitivity of 0.89; however, both RF and k-NN shared a relatively low specificity of 0.6. SVM showed a high sensitivity of 0.93, similar to RF; however, its specificity was the lowest at 0.4, bringing its Youden Index down to just 0.33. In contrast, RF and k-NN revealed the highest Youden Index values of 0.53 and 0.49, respectively. DT displayed a specificity of 0.6 and a sensitivity of 0.82, resulting in the lowest Youden Index (0.42).

Discussion

AVQI as a Voice Diagnostic Tool

AVQI was developed as a regression model to predict clinical perceptual ratings of overall voice quality [33]. It is characterized by its simplicity, speed, and accessibility, making it particularly advantageous in clinical settings with limited resources, such as time or budget. AVQI serves as a quick and reliable voice assessment tool, especially when used with open-access software like VOXplot, which provides a user-friendly and efficient way to obtain AVQI scores, even during sample collection. These qualities contribute to AVQI’s suitability for environments where practicality and ease of use are priorities. Due to these advantages, AVQI has attracted significant attention from scholars, with numerous studies evaluating its effectiveness. Previous research has consistently demonstrated AVQI’s strong ability to accurately align with perceptual judgments of voice quality across various languages [3437].

Given AVQI’s strong potential for assessing voice quality, we hypothesized that it could effectively screen clinically diagnosed voice disorders with a performance comparable to advanced techniques like ML, which are commonly used in the literature for such screening tasks. To test this hypothesis, we evaluated the AVQI-3 index’s accuracy in identifying voice disorders in American English speakers. The analysis identified 1.54 as the optimal cutoff score for the AVQI-3 index in detecting voice disorders from audio recordings. This cutoff differs from findings in related studies. For instance, Barsties V. Latoszek et al. [38] reported a higher AVQI-3 cutoff of 3.31, but their study focused on Lithuanian speakers rather than English speakers, as in our research. Given that AVQI-3 uses connected speech samples, variations in cutoff scores across languages are expected. The literature confirms these differences found in the AVQI-3’s cutoff scores: 1.41 in Japanese, 2.43 in Dutch, and 2.28 in Spanish [34, 35, 51]. Only one study with English speakers found a close cutoff at 1.17 [49], but it used AVQI-3 as a voice quality assessment tool compared to perceptual ratings by clinicians, rather than for diagnosing voice disorders, which could explain the slight variation in the cutoff score.

The current study found that AVQI-3 achieved high specificity (0.8), outperforming all other ML models; however, it had the lowest sensitivity (0.55). Consequently, AVQI-3 did not achieve overall high classification performance compared to other methods, falling short of meeting clinical thresholds for strong diagnostic performance (LR+ >10 and LR− <0.1) [54]. Our findings align with the study by Barsties V. Latoszek et al. [38] on Lithuanian speakers, which similarly showed that AVQI-3 did not meet these recommended thresholds for diagnostic use. Interestingly, a previous study on the same subjects demonstrated that AVQI-3 achieved strong scores and clinical applicability when used to evaluate voice quality against perceptual judgments [49]. This could be because AVQI-3 was originally developed to predict clinicians’ perceptual ratings of overall voice quality. Other indices that incorporate parameters assessing vocal performance and functioning, rather than focusing solely on voice quality, may be more appropriate for screening voice disorders. For instance, the DSI, which has been clinically used to evaluate the outcomes of phonosurgery and voice therapy, combines both vocal functioning metrics (such as maximum phonation time) and voice quality metrics (such as jitter) [30]. Therefore, AVQI-3 appears to be more suited for evaluating voice quality rather than reliably detecting voice pathology.

CPPs as a Voice Diagnostic Tool

While CPPs has been widely used for distinguishing dysphonic from non-dysphonic voices due to its strong correlation with voice quality [2227], this study aimed to compare the performance of a single acoustic measure (CPPs), a composite index (AVQI-3) that integrates multiple acoustic features, and ML models as a more complex multiparametric approach. This comparison allows for a better understanding of how each method performs in identifying dysphonic voices and whether incorporating multiple parameters provides an advantage over a single measure. Although CPPs is a component of AVQI-3, its individual evaluation remains valuable for assessing the contribution of a single well-established acoustic measure to voice disorder classification. The present study revealed that CPPs outperformed AVQI-3 as a standalone feature in identifying clinically diagnosed voice disorders. The optimal cutoff for CPPs was determined to be 14.35 dB, achieving a balanced sensitivity of 0.65 and specificity of 0.78. When comparing this threshold to previous studies involving English speakers and using the same software (Praat), this result was nearly identical to the 14.45 dB cutoff reported by Murton et al. [15] for the sustained vowel /a:/ task. In contrast, in Murton et al.’s study, the threshold for running speech tasks, Rainbow Passage, was lower at 9.33 dB. Hence, despite calculating CPPs, by concatenating voiced segments from both running speech and sustained vowels, the present CPPs cutoff aligns more closely with the vowel-based threshold found by Murton et al. [15]. This implies that concatenating these segments may not substantially enhance CPPs’ performance as the cutoff remains similar to that derived from vowels alone. Further investigation is needed to explore the impact of different concatenation strategies on CPPs’ efficacy and whether alternative approaches could yield improved classification outcomes.

Moreover, CPPs outperformed AVQI-3 and the DT classifier, suggesting that it more effectively captures essential aspects of voice quality closely tied to vocal pathology. Considering simpler measures like CPPs could provide practical, accessible, and accurate solutions for dysphonia screening, especially since CPPs can be easily obtained using tools like Praat, a free and easy-to-use software, within a short period of time. The potential of CPPs in screening vocal pathology, as demonstrated in this study and supported by prior research, suggests a promising direction for future work. Multiparametric approaches and indices based on spectral and cepstral analysis could be further explored and compared with ML models such as CSID [22, 23] as they may yield comparable effectiveness in dysphonia screening.

ML as a Voice Diagnostic Tool

The present study highlighted the superior performance of ML models, particularly RF and k-NN, in accurately identifying voice disorders compared to traditional acoustic measures like AVQI-3 and CPPs. The high sensitivity of RF and SVM suggests these models are especially effective in detecting cases of voice pathology. The superior ML performance over AVQI-3 is due to its ability to nonlinearly combine features and identify patterns not apparent when using simple linear combinations as in AVQI-3. These findings align with prior research showing that more complex ML kernels (progressing from linear to cubic polynomial) improve classification accuracy when using speech and voice features [55]. Nonlinear capabilities allow ML models to deliver enhanced accuracy in identifying voice disorders by leveraging advanced algorithms.

While the ML models demonstrated relatively high sensitivity and overall accuracy, their specificity remained a significant challenge. Consequently, none of the models met the recommended criteria for acceptable diagnostic performance, with LR+ scores below 10 and LR− scores above 0.1 [54]. A potential explanation for the poor specificity observed could be the imbalance in the number of healthy voice samples and voice disorder samples (49 vs. 138 samples).

Since this study did not include hyperparameter optimization, the incorporation of additional features, or the application of feature selection techniques, the current ML accuracies represent baseline performance. This highlights significant potential for improvement, which remains unexplored in this work. Thus, even with these minimal/baseline ML accuracies, all models outperformed the AVQI index – emphasizing the promising ML performance in screening voice disorders. However, the primary objective of this study was to assess default, commonly used ML models built on AVQI-3’s six acoustic features, ensuring a fair comparison with the AVQI-3 index by treating AVQI-3 as a linear regression model as it was originally developed. Although ML outperformed AVQI-3, ML models also present notable challenges. High computational resources and expertise are required to develop and utilize these models effectively. Additionally, the lack of user-friendly tools, such as accessible apps or software designed for clinical practice, creates a significant gap in practical implementation. At present, ML models are not easily incorporated into routine clinical workflows, posing a barrier to their broader adoption in voice clinics. On the other hand, AVQI-3, though it struggled with lower sensitivity, still achieved the highest specificity among all ML models despite the dataset imbalance. This highlights the inherent trade-off between accurately identifying true cases and minimizing false positives. Overall, the results indicate that ML models hold promise for comprehensive voice disorder screening in the future; meanwhile, clinical applications must carefully balance sensitivity and specificity based on diagnostic priorities when considering AVQI-3.

Clinical Impact

AVQI-3, CPPs, and ML all have the potential to provide accessible methods to screen for voice disorders. For example, using smartphones, people could regularly assess their voices, detect voice problems early, and seek timely medical attention. None of CPPs, AVQI, or ML will likely be able to differentiate between different pathologies (e.g., polyp, cyst, dysplasia, or malignancy). However, these measures could serve as an initial screening tool and prompt individuals to seek evaluation from a laryngologist or speech-language pathologist for a videostroboscopy examination. Early diagnosis and treatment would likely lead to a better outcome. Ideally, individuals would obtain these measures initially when the voice is normal since changes in these measures may be more indicative of pathology.

Having a voice disorder does not necessarily result in disrupted acoustic signals or altered voice quality, particularly in the modal voice register, which refers to the normal range of voice production used in everyday speech and conversations. This may explain why AVQI-3, initially developed to evaluate voice quality, did not achieve acceptable accuracy scores when detecting clinically diagnosed patients. Superficial vocal fold scar, mild inflammation, or small mass lesions of the vocal folds may not always impair vibration significantly enough to affect voice quality, except in the higher vocal range when the cover of the fold is on greater tension. As such, using the AVQI-3 index, or even clinical perceptual assessments, may be insufficient to identify pathology unless protocols are developed to explore the vocal range [38]. On the other hand, nonorganic dysphonia can impact voice quality, despite normal vocal fold structure and mobility [23, 56, 57]. Therefore, AVQI-3 may help identify and monitor voice pathologies affecting voice quality, which warrants further research.

An additional clinical application could involve CPPs, which demonstrated superior performance compared to AVQI-3 and achieved a Youden Index comparable to most ML models. Establishing a cutoff score for CPPs, aligned with previous research [15], may enable it to serve as both a screening tool for detecting voice disorders and a metric for tracking treatment outcomes. Because CPPs does not rely on fundamental frequency estimation – challenging to compute in cases of severely impaired voice quality – and instead assesses overall noise intensity in the signal, it is well-suited for a broad range of voice disorders. Furthermore, CPPs can be obtained easily using free software across various tasks (e.g., sustained vowels, running speech, or concatenated samples, as shown in this study), making it accessible and convenient for healthcare providers in clinical settings. Additionally, CPPs offers more straightforward interpretation for clinicians, providing meaningful insights into voice assessment without the complexity of ML models, which are hard to interpret.

ML models demonstrate high accuracy and have the potential to become valuable tools in clinical environments, showing strong capabilities in voice disorder screening both in this study and prior research. With the required infrastructure and expertise, ML could be highly effective in voice clinics and research settings. However, their integration into clinical practice poses challenges due to the need for computational resources and specialized skills. For example, complex ML models often have large parameter sizes to improve generalizability, but this increases computational demands. As a result, they require high-performance hardware and have long inference times, limiting real-time clinical use. Additionally, high computational costs and reliance on external resources further hinder their practical implementation. These challenges make simpler tools like AVQI more practical for initial screenings in resource-limited clinical environments (e.g., rural clinics or low-income hospitals with insufficient diagnostic tools and computational capacity). Ensuring user-friendly implementation, such as automated workflows and integration into existing clinical systems, is crucial for improving usability and facilitating adoption. Future studies should explore ways to leverage ML to enhance clinicians’ and researchers’ understanding of the physiological aspects of voice disorders, moving beyond “black box” models and providing insights that complement, rather than replace, clinical expertise. Interpretability is key, enabling ML models to link outcomes to vocal pathology and support optimal patient care.

Overall, this study highlights a clear trade-off between accessibility and accuracy when comparing AVQI, CPPs, and ML models. While AVQI and CPPs offer simplicity and ease of use, making them well-suited for general clinical settings with limited resources, ML models represent a promising approach for more comprehensive screening, especially in research settings or advanced facilities. In clinics with limited equipment, AVQI and CPPs may be sufficient for basic screenings, whereas ML models, with their higher accuracy, could provide added value in specialized cases.

Limitations and Future Directions

The present study has several limitations that can guide future research directions. First, this study focuses on comparing CPPs and AVQI-3 with ML models specifically for American English speakers only. Considering only the English language may limit the generalizability of our findings due to linguistic variations across different languages. Since connected speech was used to extract the acoustic parameters of interest and train the models, applying these findings to other languages should be done with caution. For example, the optimal cutoff score for AVQI-3 in distinguishing healthy from disordered voices may differ across languages. Compared to our identified threshold of 1.54 in English speakers, previous studies have demonstrated variability across languages reporting AVQI-3 cutoff scores of 3.31 in Lithuanian [38], 1.41 in Japanese, 2.43 in Dutch, and 2.28 in Spanish speakers [34, 35, 51]. Therefore, expanding the scope to include other languages and more diverse populations would provide valuable insights into the generalizability and versatility of these tools. Also, although commonly used voice quality measures like AVQI-3 show promise in screening for voice disorders, the present approach could be repeated in different ways with various other multiparametric indices and single voice parameters that exist and will exist in the future. For example, future research could investigate measures that assess vocal function, such as the DSI, or cepstral analysis-based indices like the CSID, which may be more suitable for screening voice disorders. These avenues offer potential for improving diagnostic accuracy and applicability in diverse clinical and research contexts. Another potential area for future exploration lies in the choice of speech tasks used in this study. The current approach utilized a specialized task, combining sustained vowels ([a:]) with a portion of the Rainbow Passage, to obtain AVQI-3 and CPPs. Future research could investigate alternative speech tasks, such as different sustained vowels or running speech tasks. These variations may enhance the performance of both metrics and ML models by better capturing abnormalities in voice disorders.

An imbalance due to the low number of healthy subjects compared to patients is a limitation of this study, potentially leading to overfitting to disordered voices and reducing the model’s ability to generalize to healthy populations. This imbalance may result in higher false-positive rates, where healthy individuals could be misclassified as having a voice disorder. This reduction affected the models’ specificity (with an average score of 55% across models) and limited their clinical reliability. This imbalance could also lead to an unrepresentative distribution of healthy and disordered voices in the training and testing sets. To address this, stratified data splitting was applied to preserve class proportions and prevent biased evaluation. However, while this approach ensures balanced representation, it does not mitigate the underlying class imbalance, which requires additional techniques such as data augmentation for healthy voice samples. Hence, a more balanced dataset, in future studies, can enhance ML robustness, improve ML classification accuracy, and provide a more comprehensive comparison with common acoustic measures such as AVQI-3 and CPPs. Furthermore, while the primary goal of this study was to compare baseline ML models with AVQI-3 and CPPs under fair conditions, there is room for improvement in ML performance. For instance, hyperparameter optimization, which was not conducted in this analysis, could yield better results. Further exploration of more advanced models with optimized hyperparameters could enhance their diagnostic accuracy and robustness. Future work should focus on hyperparameter tuning and model refinement to enhance clinical applicability. Additionally, exploring feature selection or incorporating additional acoustic voice features offers further opportunities to improve model accuracy. Another broader limitation of ML in detecting voice disorders from audio samples is its reliance on recording techniques, such as microphone type and environmental factors (e.g., background noise and room reverberation). These conditions can introduce variability in the extracted acoustic voice features. This variability may impact both model accuracy and generalizability – making it an important area for further research and exploration.

Conclusion

This study evaluated the effectiveness of AVQI-3, CPPs, and various ML models in identifying voice disorders, highlighting their distinct strengths. Although AVQI-3 has demonstrated high efficacy as an objective acoustic metric for approximating voice quality, its application should be approached with caution beyond this scope. Specifically, it is not designed, nor should it be solely relied upon, as a tool for voice disorder screening. CPPs demonstrated reliable performance, even surpassing AVQI-3 and approaching the accuracy of ML models. ML models also outperformed AVQI-3 with high screening performance – even without extensive hyperparameter tuning while using only the AVQI-3’s six acoustic features as input. The success of the nonlinear ML models compared to classical linear regression highlights the nonlinear relationships between acoustic parameters in disordered voices, further supporting ML as an effective screening tool.

The study highlights a trade-off: AVQI-3, and especially CPPs, provide accessible, interpretable tools for voice clinics with limited resources while enabling continuous tracking of treatment progress. In contrast, ML models offer potential for advanced diagnostics – especially in settings with advanced computational resources and specialized expertise – and could be more widely used if their interpretability for clinical decisions improves.

Acknowledgments

We thank scholars worldwide for their valuable contributions to voice/speech science and vocal health.

Statement of Ethics

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Michigan State University (protocol code STUDY00006972, 2 March 2022). Written informed consent was obtained from all participants involved in the study.

Conflict of Interest Statement

The authors have no conflicts of interest to declare.

Funding Sources

This work was supported by the National Institute of Deafness and Other Communication Disorders of the National Institutes of Health (Award No. R01DC012315). The funder had no role in the design, data collection, data analysis, and reporting of this study. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author Contributions

Conceptualization: A.M.Y., A.C.-A., M.L.B., and E.J.H.; methodology: A.M.Y., A.C.-A., J.C., A.D.R., and E.J.H.; software: A.M.Y. and A.C.-A.; validation: A.C.-A., M.L.B., and E.J.H.; formal analysis and project administration: A.M.Y. and A.C.-A.; investigation: J.C., A.D.R., M.L.B., and E.J.H.; resources: J.C. and A.D.R.; data curation: A.C.-A.; writing – original draft preparation: A.M.Y., A.C.-A., and M.L.B.; writing – review and editing: A.M.Y., M.L.B., A.C.-A., E.J.H., and A.D.R.; visualization: A.M.Y. and M.L.B.; funding acquisition and supervision: E.J.H. All authors have read and agreed to the published version of the manuscript.

Funding Statement

This work was supported by the National Institute of Deafness and Other Communication Disorders of the National Institutes of Health (Award No. R01DC012315). The funder had no role in the design, data collection, data analysis, and reporting of this study. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Availability Statement

The data that support the findings of this study are not publicly available due to their containing information that could compromise the privacy of research participants but are available from the corresponding author, A.M.Y., upon reasonable request.

References

  • 1. Bhattacharyya N. The prevalence of voice problems among adults in the United States. Laryngoscope. 2014;124(10):2359–62. [DOI] [PubMed] [Google Scholar]
  • 2. Morris MA, Meier SK, Griffin JM, Branda ME, Phelan SM. Prevalence and etiologies of adult communication disabilities in the United States: results from the 2012 National Health Interview Survey. Disabil Health J. 2016;9(1):140–4. [DOI] [PubMed] [Google Scholar]
  • 3. Jacobson BH, Johnson A, Grywalski C, Silbergleit A, Jacobson G, Benninger MS, et al. The voice handicap index (VHI): development and validation. Am J Speech Lang Pathol. 1997;6(3):66–70. [Google Scholar]
  • 4. Kempster GB, Gerratt BR, Verdolini Abbott K, Barkmeier-Kraemer J, Hillman RE. Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol. Am J Speech Lang Pathol. 2009;18(2):124–32. [DOI] [PubMed] [Google Scholar]
  • 5. Nemr K, Simoes-Zenari M, Cordeiro GF, Tsuji D, Ogawa AI, Ubrig MT, et al. GRBAS and Cape-V scales: high reliability and consensus when applied at different times. J Voice. 2012;26(6):812.e17–22. [DOI] [PubMed] [Google Scholar]
  • 6. Alipour F, Finnegan E, Scherer R. Aerodynamic and acoustic effects of abrupt frequency changes in excised larynges. J Speech Lang Hear Res. 2009;52(2):465–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Mehta DD, Hillman RE. Voice assessment: updates on perceptual, acoustic, aerodynamic, and endoscopic imaging methods. Curr Opin Otolaryngol Head Neck Surg. 2008;16(3):211–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Beale T, Twigg VM, Horta M, Morley S. High-resolution laryngeal US: imaging technique, normal anatomy, and spectrum of disease. Radiographics. 2020;40(3):775–90. [DOI] [PubMed] [Google Scholar]
  • 9. Kosztyła-Hojna B, Duchnowska E, Zdrojkowski M, Łobaczuk-Sitnik A, Biszewska J. Application of High Speed Digital Imaging (HSDI) technique and voice acoustic analysis in the diagnosis of the clinical form of Presbyphonia in women. Otolaryngol Pol. 2020;74(5):24–30. [DOI] [PubMed] [Google Scholar]
  • 10. Yousef AM. Laryngeal mechanisms and vocal folds function in adductor laryngeal dystonia during connected speech [Ph.D. Dissertation]. Michigan State University; 2023. [Internet] [cited 2024 Jun 29]. Available from: http://login.proxy.lib.uiowa.edu/login?url=https://www.proquest.com/dissertations-theses/laryngeal-mechanisms-vocal-folds-function/docview/2835386468/se-2 [Google Scholar]
  • 11. Yousef AM, Deliyski DD, Zacharias SR, de Alarcon A, Orlikoff RF, Naghibolhosseini M. Spatial segmentation for laryngeal high-speed videoendoscopy in connected speech. J Voice. 2023;37(1):26–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Alegria R, Freitas SV, Manso MC. Is there an improvement on acoustic voice parameters in patients with bilateral vocal fold nodules after voice therapy? a meta-analysis. Eur Arch Otorhinolaryngol. 2020;277(8):2163–72. [DOI] [PubMed] [Google Scholar]
  • 13. Awan SN, Roy N, Jetté ME, Meltzner GS, Hillman RE. Quantifying dysphonia severity using a spectral/cepstral-based acoustic index: comparisons with auditory-perceptual judgements from the CAPE-V. Clin Linguist Phon. 2010;24(9):742–58. [DOI] [PubMed] [Google Scholar]
  • 14. Batthyany C, Latoszek BBV, Maryn Y. Meta-analysis on the validity of the acoustic voice quality index. J Voice [Internet]. 2022. [cited 2024 Nov 9]; Available from: https://www.sciencedirect.com/science/article/pii/S0892199722001321 [DOI] [PubMed] [Google Scholar]
  • 15. Murton O, Hillman R, Mehta D. Cepstral peak prominence values for clinical voice evaluation. Am J Speech Lang Pathol. 2020;29(3):1596–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Patel RR, Awan SN, Barkmeier-Kraemer J, Courey M, Deliyski D, Eadie T, et al. Recommended protocols for instrumental assessment of voice: American speech-language-hearing association expert panel to develop a protocol for instrumental assessment of vocal function. Am J Speech Lang Pathol. 2018;27(3):887–905. [DOI] [PubMed] [Google Scholar]
  • 17. Baken RJ, Orlikoff RF. Clinical measurement of speech and voice: Singular Thomson Learning; 2000; p. 628. [Google Scholar]
  • 18. Kopf LM, Jackson-Menaldi C, Rubin AD, Skeffington J, Hunter EJ, Skowronski MD, et al. Pitch strength as an outcome measure for treatment of dysphonia. J Voice. 2017;31(6):691–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Sun X, Xu Y. Perceived pitch of synthesized voice with alternate cycles. J Voice. 2002;16(4):443–59. [DOI] [PubMed] [Google Scholar]
  • 20. Hillenbrand J, Houde RA. Acoustic correlates of breathy vocal quality: dysphonic voices and continuous speech. J Speech Hear Res. 1996;39(2):311–21. [DOI] [PubMed] [Google Scholar]
  • 21. Watts CR, Awan SN. Use of spectral/cepstral analyses for differentiating normal from hypofunctional voices in sustained vowel and continuous speech contexts. J Speech Lang Hear Res. 2011;54(6):1525–37. [DOI] [PubMed] [Google Scholar]
  • 22. Awan SN, Roy N, Dromey C. Estimating dysphonia severity in continuous speech: application of a multi-parameter spectral/cepstral model. Clin Linguist Phon. 2009;23(11):825–41. [DOI] [PubMed] [Google Scholar]
  • 23. Awan SN, Roy N, Zhang D, Cohen SM. Validation of the Cepstral Spectral Index of Dysphonia (CSID) as a screening tool for voice disorders: development of clinical cutoff scores. J Voice. 2016;30(2):130–44. [DOI] [PubMed] [Google Scholar]
  • 24. Esen Aydinli F, Özcebe E, İncebay Ö. Use of cepstral analysis for differentiating dysphonic from normal voices in children. Int J Pediatr Otorhinolaryngol. 2019;116:107–13. [DOI] [PubMed] [Google Scholar]
  • 25. Delgado-Hernández J, León-Gómez N, Jiménez-Álvarez A. Diagnostic accuracy of the smoothed cepstral peak prominence (CPPS) in the detection of dysphonia in the Spanish language. Loquens. 2019;6(1):e058. [Google Scholar]
  • 26. Fraile R, Godino-Llorente JI. Cepstral peak prominence: a comprehensive analysis. Biomed Signal Process Control. 2014;14:42–54. [Google Scholar]
  • 27. Lee Y, Kim G, Kwon S. The usefulness of auditory perceptual assessment and acoustic analysis for classifying the voice severity. J Voice. 2020;34(6):884–93. [DOI] [PubMed] [Google Scholar]
  • 28. Van Leer E, Pfister RC, Zhou X. An iOS-based cepstral peak prominence application: feasibility for patient practice of resonant voice. J Voice. 2017;31(1):131.e9–16. [DOI] [PubMed] [Google Scholar]
  • 29. Cantor-Cutiva LC, Ramani SA, Walden PR, Hunter EJ. Screening of voice pathologies: identifying the predictive value of voice acoustic parameters for common voice pathologies. J Voice. 2023:S0892-1997(23)00390-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Wuyts FL, De Bodt MS, Molenberghs G, Remacle M, Heylen L, Millet B, et al. The dysphonia severity index: an objective measure of vocal quality based on a multiparameter approach. J Speech Lang Hear Res. 2000;43(3):796–809. [DOI] [PubMed] [Google Scholar]
  • 31. Maryn Y, De Bodt M, Roy N. The Acoustic Voice Quality Index: toward improved treatment outcomes assessment in voice disorders. J Commun Disord. 2010;43(3):161–74. [DOI] [PubMed] [Google Scholar]
  • 32. Barsties v Latoszek B, Maryn Y, Gerrits E, De Bodt M. The Acoustic Breathiness Index (ABI): a multivariate acoustic model for breathiness. J Voice. 2017;31(4):511.e11–27. [DOI] [PubMed] [Google Scholar]
  • 33. Maryn Y, Corthals P, Van Cauwenberge P, Roy N, De Bodt M. Toward improved ecological validity in the acoustic measurement of overall voice quality: combining continuous speech and sustained vowels. J Voice. 2010;24(5):540–55. [DOI] [PubMed] [Google Scholar]
  • 34. Delgado Hernández J, León Gómez NM, Jiménez A, Izquierdo LM, Barsties V Latoszek B, Latoszek B. Validation of the acoustic voice quality index version 03.01 and the acoustic breathiness index in the Spanish language. Ann Otol Rhinol Laryngol. 2018;127(5):317–26. [DOI] [PubMed] [Google Scholar]
  • 35. Hosokawa K, Barsties V Latoszek B, Iwahashi T, Iwahashi M, Iwaki S, Kato C, et al. The acoustic voice quality index version 03.01 for the Japanese-speaking population. J Voice. 2019;33(1):125.e1–12. [DOI] [PubMed] [Google Scholar]
  • 36. Maryn Y, De Bodt M, Barsties B, Roy N. The value of the Acoustic Voice Quality Index as a measure of dysphonia severity in subjects speaking different languages. Eur Arch Otorhinolaryngol. 2014;271(6):1609–19. [DOI] [PubMed] [Google Scholar]
  • 37. Reynolds V, Buckland A, Bailey J, Lipscombe J, Nathan E, Vijayasekaran S, et al. Objective assessment of pediatric voice disorders with the acoustic voice quality index. J Voice. 2012;26(5):672.e1–e6727. [DOI] [PubMed] [Google Scholar]
  • 38. Barsties V Latoszek B, Ulozaite-Staniene N, Petrauskas T, Uloza V, Maryn Y. Diagnostic accuracy of dysphonia classification of DSI and AVQI. Laryngoscope. 2019;129(3):692–8. [DOI] [PubMed] [Google Scholar]
  • 39. Idrisoglu A, Dallora AL, Anderberg P, Berglund JS. Applied machine learning techniques to diagnose voice-affecting conditions and disorders: systematic literature review. J Med Internet Res. 2023;25:e46105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Yousef AM, Deliyski DD, Zacharias SR, Naghibolhosseini M. Deep-learning-based representation of vocal fold dynamics in adductor spasmodic dysphonia during connected speech in high-speed videoendoscopy. J Voice. 2022:S0892-1997(22)00263-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Al-Hussain G, Shuweihdi F, Alali H, Househ M, Abd-Alrazaq A. The effectiveness of supervised machine learning in screening and diagnosing voice disorders: systematic review and meta-analysis. J Med Internet Res. 2022;24(10):e38472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Hegde S, Shetty S, Rai S, Dodderi T. A survey on machine learning approaches for automatic detection of voice disorders. J Voice. 2019;33(6):947.e11–33. [DOI] [PubMed] [Google Scholar]
  • 43. Ur Rehman M, Shafique A, Azhar QUA, Jamal SS, Gheraibia Y, Usman AB. Voice disorder detection using machine learning algorithms: an application in speech and language pathology. Eng Appl Artif Intell. 2024;133:108047. [Google Scholar]
  • 44. Selvakumari NS, Radha V. A voice activity detector using SVM and Naïve Bayes classification algorithm. 2017 International Conference on Signal Processing and Communication (ICSPC): IEEE; 2017. p. 1–6. [Internet] [cited 2024 Nov 9] Available from: https://ieeexplore.ieee.org/abstract/document/8305815/ [Google Scholar]
  • 45. Yousef AM, Deliyski DD, Zacharias SR, de Alarcon A, Orlikoff RF, Naghibolhosseini M. A hybrid machine-learning-based method for analytic representation of the vocal fold edges during connected speech. Appl Sci. 2021;11(3):1179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Yousef AM, Deliyski DD, Zacharias SRC, De Alarcon A, Orlikoff RF, Naghibolhosseini M. A deep learning approach for quantifying vocal fold dynamics during connected speech using laryngeal high-speed videoendoscopy. J Speech Lang Hear Res. 2022;65(6):2098–113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Yousef AM, Deliyski DD, Zacharias SR, de Alarcon A, Orlikoff RF, Naghibolhosseini M. Automated detection and segmentation of glottal area using deep-learning neural networks in high-speed videoendoscopy during connected speech. 14th International Conference Advances in Quantitative Laryngology, Voice and Speech Research (AQL). 2021. [Internet] [cited 2024 Jun 29]. p. 29–30. Available from: https://aql2021.sciencesconf.org/data/pages/210623_BOOK_OF_ABSTRACTS_full.pdf#page=35
  • 48. Ghasemzadeh H, Hillman RE, Mehta DD. Toward generalizable machine learning models in speech, language, and hearing sciences: estimating sample size and reducing overfitting. J Speech Lang Hear Res. 2024;67(3):753–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Castillo-Allendes A, Codino J, Cantor-Cutiva LC, Nudelman CJ, Rubin AD, Barsties V Latoszek B, et al. Clinical utility and validation of the acoustic voice quality and acoustic breathiness indexes for voice disorder assessment in English speakers. J Clin Med. 2023;12(24):7679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Dietsch AM, Mocarski R, Hope DA, Woodruff N, McKelvey M. Revisiting the Rainbow: culturally responsive updates to a standard clinical resource. Am J Speech Lang Pathol. 2023;32(1):377–80. [DOI] [PubMed] [Google Scholar]
  • 51. Barsties B, Maryn Y. External validation of the acoustic voice quality index version 03.01 with extended representativity. Ann Otol Rhinol Laryngol. 2016;125(7):571–83. [DOI] [PubMed] [Google Scholar]
  • 52. Barsties B, Maryn Y. The improvement of internal consistency of the acoustic voice quality index. Am J Otolaryngol. 2015;36(5):647–56. [DOI] [PubMed] [Google Scholar]
  • 53. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J machine Learn Res. 2011;12:2825–30. [Google Scholar]
  • 54. Dollaghan CA. The handbook for evidence-based practice in communication disorders. Paul H. Brookes Publishing Co.; 2007. [Internet] [cited 2024 Nov 5]. Available from: https://psycnet.apa.org/record/2007-03655-000 [Google Scholar]
  • 55. Berardi M, Brosch K, Pfarr J, Schneider K, Sültmann A, Thomas-Odenthal F, et al. Relative importance of speech and voice features in the classification of schizophrenia and depression. Transl Psychiatry. 2023;13(1):298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Yousef AM, Deliyski DD, Zacharias SR, Naghibolhosseini M. Detection of vocal fold image obstructions in high-speed videoendoscopy during connected speech in adductor spasmodic dysphonia: a convolutional neural networks approach. J Voice. 2022. [Internet] [cited 2024 Jun 29]; Available from: https://www.sciencedirect.com/science/article/pii/S0892199722000273 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Yousef AM, Deliyski DD, Zayernouri M, Zacharias SR, Naghibolhosseini M. Deep learning-based analysis of glottal attack and offset times in adductor laryngeal dystonia. J Voice. 2023. [Internet] [cited 2024 Jun 29]; Available from: https://www.sciencedirect.com/science/article/pii/S0892199723003193 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are not publicly available due to their containing information that could compromise the privacy of research participants but are available from the corresponding author, A.M.Y., upon reasonable request.


Articles from Folia Phoniatrica et Logopaedica are provided here courtesy of Karger Publishers

RESOURCES