Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2025 Oct 29;68(12):5694–5707. doi: 10.1044/2025_JSLHR-25-00264

Segment-Based Signal Typing and Predictive Modeling in Pediatric Dysphonia With Different Vibratory Sources

Yeonggwang Park a, Supraja Anand b, Susan Baker Brehm c,d, Lisa Kelchner d,e, Barbara Weinrich c,d, Rahul Shrivastav f, Alessandro de Alarcon g, David A Eddins a
PMCID: PMC12704707  PMID: 41161677

Abstract

Purpose:

Severe dysphonia in children often poses a challenge for conventional acoustic measurement methods due to the high degree of aperiodicity, which can result in invalid or unreliable measures. Signal typing can support the validity of these measures, but current methods rely on subjective inspection and do not account for multiple signal types within a voice sample. This study aimed to improve current signal typing practices by refining a manual signal typing tool for segment-level labeling and a predictive model for objective signal typing.

Method:

Sustained /ɑ/ phonations from 94 children with a glottal vibratory source and 30 children with supraglottal vibratory source (SGVS) were evaluated by three expert speech-language pathologists using the signal typing tool. Signal type labels determined through expert consensus were considered the ground truth for each segment and used to train a predictive model. Computational measures associated with periodicity and voice quality, including pitch strength, envelope standard deviation (EnvSD8), sharpness, and smoothed cepstral peak prominence (CPPS), were extracted and used in an ordinal logistic regression model. Model performance was evaluated using a held-out test set and fivefold cross-validation.

Results:

Manual signal typing revealed that 11% of the overall samples and 20% of the samples with SGVS included two or more signal types. A predictive model incorporating EnvSD8, CPPS, and sharpness achieved good to excellent prediction accuracy (81%–96%) across signal types in both the test and cross-validation sets.

Conclusions:

The manual signal typing tool developed in this study shows promise for improving the precision of signal typing, which may enhance the reliability of conventional acoustic measures and enable the calculation of signal type proportions as potential outcome metrics. Automating signal typing using the measures investigated in this study could further increase the clinical utility of this tool by providing objective signal typing.


Pediatric dysphonia is a relatively common condition, with prevalence estimates ranging from 6% to 19% in previous studies (Carding et al., 2006; Fujiki & Thibeault, 2024; Johnson et al., 2020). It can present with varying degrees of severity, ranging from mild hoarseness to severe vocal impairment, often affecting a child's ability to communicate effectively and impacting their overall quality of life (Merati et al., 2008). The most frequent causes of pediatric dysphonia are vocal fold nodules and cysts, which are benign lesions on the vocal folds (Angelillo et al., 2008; Martins et al., 2012). These nodules and cysts are typically associated with vocal hyperfunction, a pattern of excessive or inefficient use of the laryngeal and paralaryngeal muscles (Hillman et al., 1989) that contributes to dysphonia. Children with such vocal pathologies rely primarily on their true vocal folds and use a glottal vibratory source (GVS).

In more severe cases, pediatric dysphonia arises from acquired or congenital laryngeal conditions that impair normal voice production. Premature infants often undergo prolonged or repeated tracheal intubation, increasing the risk of laryngeal trauma, including subglottic stenosis, which may require laryngotracheal reconstruction to maintain a stable airway. Additionally, some children present with congenital laryngeal abnormalities, such as vocal fold paralysis (unilateral or bilateral) or glottic web formations, which hinder proper vocal fold vibration and closure. These conditions can severely compromise voice quality by limiting the ability of the true vocal folds to function as the primary vibratory source. Instead, these children often engage alternative supraglottal vibratory sources (SGVS) such as ventricular folds as well as supraglottic and epiglottic structures to produce voice (Clary et al., 1996; Kelchner, Weinrich, et al., 2010; Smith et al., 1993; Weinrich et al., 2007). Children may also use a combination of GVS and SGVS together (Kelchner, Weinrich, et al., 2010; Krival et al., 2007), commonly referred to as mixed vibratory source (MVS). As a result, previous studies have classified pediatric dysphonia based on the primary vibratory source used for voice production (Krival et al., 2007). Due to limited glottal function, children primarily using SGVS tend to have severe aperiodicity in voice signals and dysphonia. Compared to those with GVS, children with SGVS exhibit greater roughness and strain in their voice quality (Krival et al., 2007; Park et al., 2023). Additionally, it is reported that children and adolescents with SGVS and MVS predominantly exhibit Type 2 and Type 3 signals in signal typing classification, indicating irregular and turbulent phonatory patterns (Kelchner, Weinrich, et al., 2010; Zacharias et al., 2016).

Signal typing classifies dysphonic voices into Types 1, 2, and 3 signals based on their periodicity (Titze, 1995). Type 1 represents nearly periodic signals, Type 2 is for signals with bifurcations or subharmonic and/or modulating frequencies, and Type 3 is for signals without obvious periodicity. Signal typing is necessary for validating perturbation measures such as jitter, shimmer, and harmonics-to-noise ratio (HNR), as they rely on accurate fundamental frequency (F0) estimation. Typically, these measures are recommended only for Type 1 signals (Titze, 1995). Considering SGVS voices, which predominantly exhibit Types 2 and 3 phonations due to lack of periodic vibration, this presents a significant limitation for conventional acoustic measures in this population. Recent recommendations from the American Speech-Language-Hearing Association (ASHA) do not include perturbation measures from acoustic voice evaluations (Patel et al., 2018), but they are still frequently used and recommended in research and clinical settings, as seen in recent guidelines of the European Laryngological Society and the Union of the European Phoneticians (Lechien et al., 2023). F0 is another crucial voice measure and is recommended by ASHA, but its accuracy can also be reduced in more aperiodic, higher type signals (Anand et al., 2021; Mehta & Hillman, 2008). Therefore, signal typing remains necessary to evaluate the validity of perturbation and F0 measures.

The most commonly used method of signal typing involves listening to voice samples and visualizing them on narrowband spectrograms (Titze, 1995). This approach is subjective, time consuming, and requires specialized training, limiting its practical application in clinical settings. Only a few previous studies have examined the accuracy of manual signal typing and the rater agreement of signal typing. Ma and Yiu (2005) reported 75% intrarater and 77% interrater exact agreement in the signal typing task. However, Behrman et al. (1998) reported that around 40% of signals assigned as Type 3 were not clearly distinguishable from Type 2. Another issue with the current signal typing method is that it does not specify how to handle different segments within voice samples that may exhibit different signal types. Although Titze's (1995) signal typing scheme suggests classifying signals with bifurcation or subharmonic components as Type 2 signals, it does not specify how much of the signal must exhibit these features for the sample to be classified as such. This consideration can be important because some samples labeled as Type 2 may contain minor bifurcation, and perturbation and F0 measures may still be applicable for those segments.

To address this challenge, our goal was to refine a manual signal typing tool (Anand et al., 2021) that can label within a voice sample. Labeling distinct segments according to signal type can provide clarity to raters in signal typing when varying degrees of proportions of different types are present within a voice sample. Instead of deciding whether the entire sample should be classified as Type 1 or Type 2, raters can label each segment with the appropriate type. This labeling will also enable the calculation of the proportion of different types of segments, offering more context for assessing the validity of the traditional measures. Additionally, recent studies have observed a significant increase in the proportion of Type 1 signals in voice recordings after treatment, indicating improved voice quality (Kopf et al., 2017; Liu et al., 2023). These results suggest that the proportion of Type 1 signals within a recorded sample may serve as a useful outcome measure. Similarly, changes in the proportions of Types 2 and 3, particularly in children with SGVS voices, may reflect voice quality changes and have the potential to be used as outcome measures. Therefore, the proportion of each signal type within a voice sample calculated through segment-level labeling could serve as an additional metric that may be especially valuable for SGVS populations, where conventional acoustic measures are often limited or error prone.

In addition to the subjective method of signal typing described by Titze (1995), a limited number of studies have explored an objective method of signal typing using nonlinear dynamic analyses (Liu et al., 2018, 2019). Liu et al. (2019) introduced a diffusive chaos method to quantify the presence of periodicity, deterministic chaos, and stochastic noise in sustained /ɑ/ recordings and referred to an estimated output as the voice type component (VTC) profile. This profile represents the aggregate proportion of different periodicity components (e.g., periodic, chaotic, and stochastic) within a given voice sample, analogous to signal types. Their results demonstrated that voice samples with different signal types exhibited significantly different VTC distributions. VTC1, which reflects the proportion of Type 1–like periodic components, differed significantly across all types except between Type 2 and Type 3 signals (Liu et al., 2019).

Miramont et al. (2022) used a linear support vector machine trained on a set of features selected for their relevance to signal type classification. These features included shimmer, HNR, smoothed cepstral peak prominence (CPPS), and nonlinear dynamic measures derived from sustained /ɑ/ recordings. Their final model achieved an overall 82.7% accuracy in cross-validation. However, a key limitation of this study is the inclusion of perturbation measures (shimmer and HNR), which are known to be unreliable for higher signal types. This may reduce classification reliability, especially for more aperiodic signals. Furthermore, both the Liu et al. (2019) and Miramont et al. (2022) approaches provide an estimate of the dominant signal type within a sustained /ɑ/ sample, but neither specifies which portions of the signal correspond to specific signal types over time. Developing a time-segmented signal type classification approach, particularly one that incorporates measures robust to F0 irregularity, could improve the accuracy and clinical applicability of F0-based measures as well as the development of signal type proportion metrics discussed earlier.

As a step toward developing time-segmented automated signal type estimation, this study used segment-based signal typing labels obtained through manual signal typing by speech-language pathologists (SLPs) to develop a predictive model for signal typing. Rather than using the diffusive chaos method and perturbation measures used in previous studies, this study explores alternative measures that may offer greater interpretability and better applicability for time-segmented signal typing. Previous research has demonstrated that computational estimates of pitch strength, which reflect the salience or strength of the perceived pitch, showed significant differences between signal types in adult voices (Anand et al., 2021). This is consistent with the understanding that pitch strength is directly related to the periodicity of the voice signal and suggests that pitch strength estimate could be a useful measure for automated signal typing. Cepstral peak prominence (CPP) is a widely used measure that represents the normalized cepstral peak amplitude. CPP indicates the strength of the cepstral peak and reflects the degree of signal periodicity (Heman-Ackah et al., 2003; Hillenbrand et al., 1994). Thus, we hypothesized that both pitch strength and CPP would decrease as signal type increased.

The pitch strength estimate is one of several measures derived from bio-inspired computational models of auditory processing, which incorporates model stages that reflect how the human auditory system processes acoustic signals. Processing voice signals using these models has demonstrated strong predictive capabilities for the perception of various voice qualities, including breathiness, roughness, and strain (Anand et al., 2019; Eddins et al., 2016; Park et al., 2022; Shrivastav et al., 2011). These computational measures for voice quality are grounded in a comprehensive psychoacoustic framework of general sound quality, where tonality linked to periodicity is associated with vocal breathiness, roughness related to amplitude fluctuations with perceived roughness, and sharpness determined by the shape of the spectral envelope with vocal strain (Anand et al., 2019; Eddins et al., 2015, 2016; Park et al., 2023).

Given that manual signal typing relies on the auditory perception of the rater along with visual inspection of narrowband spectrograms, these computational measures related to voice quality, including pitch strength estimate, could also contribute to the objective assessment of signal typing. In addition to the pitch strength correlate of breathiness, the perception of vocal roughness has been correlated with the output of a bio-inspired model of temporal envelope perception. Specially, the envelope standard deviation (EnvSD8) measure, which represents the standard deviation of the amplitude envelope processed using a temporal envelope modulation filter bank model that simulates the auditory system's processing of amplitude fluctuations, has been associated with perceived roughness (Park et al., 2022, 2023). Considering that increased roughness and EnvSD8 values have been observed in SGVS voices (Krival et al., 2007; Park et al., 2023), which predominantly exhibit Types 2 and 3 signal measures (Baker et al., 2006; Kelchner, Brehm, et al., 2010), we hypothesized that this measure would be elevated in higher signal types and contribute to signal typing prediction. Similarly, a bio-inspired model of spectral sharpness has been correlated strongly with perceived vocal strain (Park et al., 2023) and could also be used in the prediction of signal types, as it may reflect the increased mid–high frequency spectral energy often observed in dysphonic voices and higher type signals. Accordingly, we hypothesized that sharpness values would be greater in higher signal types and would also contribute to accurate signal typing prediction.

This study aimed to advance the process of signal typing in voice analysis. To achieve this, we refined a user-friendly tool to facilitate the labeling of segments with different signal types and developed predictive models of signal typing in pediatric voices with different vibratory sources, using objective measures related to periodicity and voice quality that are not dependent on accurate F0 estimation. This segment-level labeling approach was designed to improve clarity and precision compared to traditional whole-sample signal typing, particularly for voice samples containing multiple signal types, and to enhance precision for predictive model development.

Method

Voice Samples

We obtained pediatric voice samples from the Cincinnati Children's Hospital Medical Center (CCHMC) pediatric voice database for children in the age range of 4–11 years. From this database, a total of 124 samples were selected, including 94 GVS (34 females, 60 males; Mage = 7.3 years) and 30 SGVS (13 females, 17 males; Mage = 7.9 years) voices. These samples were selected based on the availability of verified vibratory source information and at least 1 s of steady sustained /ɑ/ phonation. The uneven distribution of samples among the vibratory groups resulted from the limited availability of SGVS samples. The GVS group primarily comprised diagnoses related to vocal fold abnormalities: vocal fold lesions including nodules and cysts (n = 74), hemorrhage (n = 1), granuloma (n = 1), papilloma (n = 1), and microwebs (n = 1), as well as functional dysphonia (n = 7), laryngospasm (n = 1), and vocal cord dysfunction associated with breathing difficulties, including conditions such as exercise-induced laryngeal obstruction and paradoxical vocal fold motion (n = 5). Additionally, the GVS group included one child with subglottal stenosis and two children with unilateral and bilateral vocal fold paralysis, as these individuals still used the vocal folds as their primary vibratory source. The SGVS group predominantly presented with dysphonia associated with subglottic stenosis and/or prior laryngeal–tracheal reconstruction surgery (n = 24), unilateral or bilateral vocal fold paralysis (n = 5), and vocal cord dysfunction presenting with dyspnea on exertion (n = 1). The vibratory source was verified through a meticulous protocol involving voice recording assessments, laryngeal videostroboscopy, and narrowband spectrogram analysis by a panel of three SLPs and two laryngologists. The panel viewed the visual images for identification of the primary vibrating structures and utilized the voice recording and spectrogram as supports and confirmation. The use and sharing of pediatric voice recordings from CCHMC was conducted in compliance with the institutional review boards of the University of South Florida (IRB Pro0012381) and the University of Central Florida (Study 5671). All samples were cropped to retain the most stable 1-s portions and scaled to have equal root-mean-square amplitude.

Manual Signal Typing

The signal typing software used in this study was developed in MATLAB (R2022a, MathWorks) and refined from the previous version (Anand et al., 2021), which included segment-level labeling features. In this version, we added a playback functionality that allows users to listen to both the entire waveform and selected segments, supporting more precise segment labeling. Additionally, the visualization of the spectrogram was enhanced using a narrowband window size (10 ms) and a black-and-white color scheme, similar to those generated in Praat (Boersma & Weenink, 2021). Default spectrogram default settings included a Hamming window type, frame rate of 10,000 frames per second, 1024-point Fast Fourier Transform (FFT), and zero-mean adjustment. Additionally, the tool included a zoom function that allowed users to magnify specific waveform regions along with their corresponding spectrogram. Figure 1 presents the layout of the tool, displaying both the acoustic waveform and the spectrogram. Different portions of the acoustic signals could be labeled, with colors representing different signal types.

Figure 1.

A screenshot of a window titled label Trainer displays a waveform (upper panel). The waveform plots the relative amplitude (y-axis) versus time in seconds (x-axis). The spectrogram (lower panel) plots the frequency in kiloHertz versus time in seconds.

MATLAB graphical user interface used in manual signal typing. The upper panel displays the acoustic waveform with relative amplitude (y-axis) as a function of time in seconds (x-axis). The lower panel displays the spectrogram with frequency in kiloHertz on the y-axis as a function of time in seconds (x-axis) and grayscale indicating amplitude.

Three experts (authors S.B.B., L.K., and B.W.), each with over 25 years of experience in evaluating and treating pediatric dysphonia, conducted manual signal typing. They underwent refresher training on signal typing, which included reviewing definitions of each signal type (Types 1, 2, and 3) from Titze (1995) and familiarizing themselves with the signal typing task through practice samples and feedback using the software. Although Sprecher et al. (2010) expanded on Titze's (1995) signal typing classification by defining Type 4 as Type 3 signal that are “stochastic in nature,” resembling broadband white noise, Type 4 was not included in this study. Manually distinguishing Type 3 and Type 4 could add difficulty to an already challenging task, and the potential utility of this distinction was deemed less critical to the study's aim. The experts received detailed instructions on tool usage and manually performed signal typing, selecting, and labeling either an entire 1-s sample or specific portions as Types 1, 2, or 3 using the application. The default spectrogram settings were deemed appropriate for the task, and the experts did not modify these settings during the labeling process. The experts conducted this task on the 124 samples.

To evaluate the intrarater agreement, 20% of the samples (n = 25) were repeated on a different day following the initial completion of the entire set by the three experts. After the three experts completed the reassessment, they convened for a consensus session. In instances where the experts did not achieve exact agreement on a sample, they engaged in discussions and collaboratively determined the final signal type and portions with different signal types for those samples (n = 55, 34 GVS and 21 SGVS). They also specified the precise duration of each labeled portion in samples containing two or more different signal types. This step aimed to ensure the accuracy of signal typing data crucial for the development of automated algorithms. The signal type assigned through expert consensus manual typing was regarded as the reference standard for each labeled portion. Based on this consensus labeling, voice samples that exhibited two or more signal types were segmented according to the labeled portions for use in training and testing automated signal typing algorithms.

Predictor Variables

To develop an automated signal typing algorithm, we used bio-inspired computational measures—the pitch strength estimate, EnvSD8 from the amplitude modulation filter bank output, and sharpness—as independent predictor variables for our model. As detailed in the introduction, these measures incorporate auditory models to process acoustic signals, reflecting auditory processing and correlate with perceived breathiness, roughness, and strain. The bio-inspired computational measures were derived from each labeled portion, using custom MATLAB scripts. Pitch strength was obtained using Aud-SWIPE' (Camacho, 2012), EnvSD8 was obtained using the Dau model (Dau et al., 1997) detailed in Park et al. (2022) and refined in Park et al. (2025), and sharpness was obtained using the Zwicker model (Fastl & Zwicker, 2007). Additionally, CPPS, which incorporates smoothing of the cepstral signal before estimating CPP, was obtained using a Praat plug-in developed by Heller Murray et al. (2025). All of our predictor variables do not necessitate accurate F0 estimation, which is a crucial advantage, particularly for SGVS samples with predominant Type 2 and Type 3 signals.

Statistical Analysis

In this initial phase of automating signal typing, ordinal logistic regression was employed to evaluate the predictive capability of the proposed predictor variables in determining signal types and their respective predictive strengths. Ordinal logistic regression is a statistical technique to predict an ordinal dependent variable with one or more independent predictor variables. One of the assumptions underlying ordinal regression is the collinearity assumption, which ensures that there is no substantial correlation among two or more predictor variables. To examine this assumption, we conducted a collinearity assessment, obtaining variance inflation factors (VIFs) to quantify the extent of collinearity among predictors. Our analysis revealed that pitch strength and CPPS exhibited VIF values indicative of significant collinearity (Johnston et al., 2018). Consequently, we proceeded to develop two distinct ordinal regression models: one incorporating pitch strength, EnvSD8, and sharpness and the other integrating CPPS, EnvSD8, and sharpness.

Results

Intra- and Interrater Agreement

Our assessment of the intrarater agreement involved a reevaluation of 20% of the samples, which were randomly selected for repetition. These repeated ratings were conducted prior to the consensus session, ensuring that the meeting did not influence the agreement data. To calculate precise agreement, we considered matching portions in both ratings, even in cases where portions were labeled for durations less than 1 s. The experts demonstrated an exact percent agreement ranging from 71% to 90.1% (see Table 1). Discrepancies within the experts only involved differences of one signal type. Table 1 also presents the interrater agreement outcomes for each vibratory source category and for the collective sample set. Specifically, the exact percent agreement among the three experts for GVS was 73.3%, and for SGVS, it was 54.4%. When disagreements occurred, they predominantly consisted of differences of one type, with only one labeled portion demonstrating differences of two types among the three experts.

Table 1.

Intra- and interrater percent agreement of signal typing.

Variable Exact 1 type difference 2 type difference
Intrarater percent agreement
 Rater A 71.0% 29.0% 0%
 Rater B 90.1% 5.9% 0%
 Rater C 84.0% 16.0% 0%
Interrater percent agreement
 GVS 73.3% 26.5% 0.2%
 SGVS 54.4% 45.6% 0%
 All 68.7% 31.1% 0.2%

Note. GVS = glottal vibratory source; SGVS = supraglottal vibratory sources.

Signal Type–Labeled Portions

Based on the manual signal typing consensus, we collected a total of 141 signal type–labeled portions, encompassing both the entire 1-s samples (n = 110) and segments shorter than 1 s (n = 31), all from the 124 voice samples (94 GVS and 30 SGVS) analyzed in our study. Specifically, eight GVS samples were segmented into 19 segments, while six SGVS samples were segmented into 12 segments. These segments varied in duration from 0.13 to 0.8 s. A data set comprising 141 signal type–labeled portions, inclusive of 1-s samples and shorter segments, guided the automated signal typing development.

Table 2 illustrates the distribution of labeled portions by signal type and vibratory source. The labeled portions from the GVS group were categorized predominantly as Type 1, whereas a majority from the SGVS category fell into Type 2 or 3, consistent with previous research (Kelchner, Weinrich, et al., 2010; Zacharias et al., 2016). Overall, our data set comprised 57% Type 1 (n = 81), 28% Type 2 (n = 39), and 15% Type 3 (n = 21) labeled portions. The table also lists the number of labeled segments (less than 1 s) for each signal type and vibratory source, shown in parentheses. Notably, Types 2 and 3 exhibited a higher proportion of shorter segments compared to Type 1. This is because most Type 1 samples were of the entire length, whereas many Type 2 and Type 3 signals occurred more intermittently during sustained phonation in our disordered voice samples.

Table 2.

Signal type distribution of the labeled portions.

Vibratory source Type 1 Type 2 Type 3
GVS 78 (10) 25 (9) 2 (0)
SGVS 3 (0) 14 (6) 19 (6)
Total 81 (10) 39 (15) 21 (6)

Note. Parentheses indicate the number of labeled segments between 0.13 and 0.8 s. GVS = glottal vibratory source; SGVS = supraglottal vibratory sources.

Ordinal Logistic Regression

To construct an ordinal regression predictive model for signal typing, the 141 labeled portions were randomly divided into training and test sets, using a 70/30 ratio. A stratified split was used to ensure that each set reflected the proportion of each signal type (see Table 3). An assumption of ordinal regression is the absence of multicollinearity among predictor variables. To assess multicollinearity, VIFs were calculated for each predictor in the training set. Pitch strength (VIF = 7.14) and CPPS (VIF = 5.07) exhibited VIF values exceeding 2.5, often considered to indicate substantial collinearity (Johnston et al., 2018), while EnvSD8 (VIF = 1.76) and sharpness (VIF = 1.26) did not exceed this threshold. To address the collinearity between pitch strength and CPPS, two separate ordinal regression models were constructed using the training set: one including pitch strength, EnvSD8, and sharpness as predictors and the other including CPPS, EnvSD8, and sharpness.

Table 3.

Distribution of signal types in training and test sets (70/30 split).

Set Type 1 Type 2 Type 3
Training 57 (7) 27 (11) 15 (5)
Test 24 (3) 12 (4) 6 (1)

Note. Parentheses indicate the number of labeled segments between 0.13 and 0.8 s.

In the first model, which included pitch strength, sharpness (p = .4) was not a significant predictor and was therefore excluded. The final model with pitch strength and EnvSD8 accounted for a significant amount of variance (χ2 = 146.91, p < .001, Nagelkerke's R2 = .91). A second model included CPPS, EnvSD8, and sharpness as predictors. This model also accounted for a significant amount of variance (χ2 = 150.59, p < .001; Nagelkerke's R2 = .92). Table 4 presents the parameter estimates for predictor variables in each model. As hypothesized, higher PS and CPPS values were associated with lower signal types. In contrast, lower EnvSD8 and sharpness values were associated with lower signal types.

Table 4.

Parameter estimates for two ordinal logistic regression models (training set).

Predictors β Wald p
PS & EnvSD8
 PS −47.72 17.70 < .001
 EnvSD8 85.25 11.08 < .001
EnvSD8, CPPS, & Sharp
 EnvSD8 162.69 13.507 < .001
 CPPS −2.18 12.676 .001
 Sharp 13.37 8.236 .004

Note. PS = pitch strength; EnvSD8 = envelope standard deviation at modulation filter 8; CPPS = smoothed cepstral peak prominence; Sharp = sharpness.

Two ordinal logistic regression models were constructed to predict the cumulative probability of the signal type being at or below category j, as follows:

PS & EnvSD8 Model:

logitPTypej=ɑj+47.72·PS85.25·EnvSD8j=1,2

where the estimated thresholds for Model 1 are ɑ1 = −8.68 (Type ≤ 1) and ɑ2 = 0.77 (Type ≤ 2).

EnvSD8, CPPS, & Sharpness Model:

logitPTypej=ɑj162.69·EnvSD8+2.18·CPPS13.37·Sharpj=1,2

where the estimated thresholds for Model 2 are ɑ1 = −0.29 (Type ≤ 1) and ɑ2 = 14.02 (Type ≤ 2).

Prediction Accuracy

Figure 2 displays the prediction accuracy of both models for the training set (left side) and the test set (right side). In the training set, both models demonstrated high overall performance, achieving 93% accuracy in correctly predicting signal types. Accuracy across different signal types was also similar between the two models, with the highest accuracy observed for Type 1 (96%), followed by Type 3 (93%) and Type 2 (85%). Although both models showed the same accuracy in training set, further examination revealed that four of seven incorrectly typed samples were different between the two models.

Figure 2.

A bar graph for the percentage accuracy by signal type and model for the training set and the test set. The training set has n equals 99 participants. The accuracies for the PS and EnvSD 8 model are as follows. All types: 93 percent. Type 1: 96 percent. Type 2: 85 percent. Type 3: 93 percent. The accuracies for the EnvSD 8, CPPS and Sharp model are as follows. All types: 93 percent. Type 1: 96 percent. Type 2: 85 percent. Type 3: 93 percent. The test set has n equals 42 participants. The accuracies for the PS and EnvSD 8 model are as follows. All types: 76 percent. Type 1: 83 percent. Type 2: 58 percent. Type 3: 83 percent. The accuracies for the EnvSD 8, CPPS, and Sharp model are as follows. All types: 88 percent. Type 1: 92 percent. Type 2: 83 percent. Type 3: 83 percent.

Prediction accuracy of the PS & EnvSD8 model and the EnvSD8, CPPS, & Sharp model for the training set (left) and the test set (right) by signal type. PS = pitch strength; EnvSD8 = envelope standard deviation at modulation filter 8; CPPS = smoothed cepstral peak prominence; Sharp = sharpness.

The overall accuracy in the test set was lower than that obtained from the training set for both models. However, the model with three predictors (EnvSD8, CPPS, and Sharpness) had a higher overall accuracy (88%) and was closer to its training accuracy compared to the model with only pitch strength and EnvSD8 (76%). The three-predictor model also generally outperformed the two-predictor model in accuracy by signal type, with the most pronounced difference observed for Type 2 signals (85% vs. 58%). Further examination revealed that all incorrect type prediction involved adjacent signal types: Type 1 samples predicted as Type 2, Type 2 samples as Type 1, and Type 3 samples as Type 2.

Following model testing on the test set, the EnvSD8, CPPS, and Sharpness model, which demonstrated higher accuracy, was selected for further analysis. To evaluate potential difference in accuracy between 1-s sample and segmented samples, the accuracy of each group of samples by signal type was examined by signal type for both training and test set (see Table 5). Segmented samples demonstrated slightly lower accuracy compared to 1-s samples. Additionally, no substantial difference in accuracy was found between the GVS (90%) and SGVS (83%) samples.

Table 5.

Prediction accuracy by signal type for 1-s versus segmented samples in training and test sets.

Variable Type 1 Type 2 Type 3
Training set
 1 s 100% (n = 50) 88% (n = 16) 100% (n = 10)
 Segments 71% (n = 7) 82% (n = 11) 80% (n = 5)
Test set
 1 s 95% (n = 21) 88% (n = 8) 100% (n = 5)
 Segments 67% (n = 3) 75% (n = 4) 0% (n = 1)

Note. Labeled segments are between 0.13 and 0.8 s.

Fivefold Cross-Validation

To assess potential overfitting, a fivefold cross-validation was conducted on the training set. Each fold was stratified to maintain the same ratio of each signal type (see Table 6). Figure 3 shows the mean accuracy obtained from the fivefold cross-validation for the model. The cross-validation results demonstrated comparable accuracy to those observed in the training and test sets, supporting the conclusion that the model was not overfitting.

Table 6.

Distribution of signal types across cross-validation folds in training set.

Fold Type 1 Type 2 Type 3
1 12 (2) 5 (1) 3 (1)
2 12 (1) 5 (4) 3 (0)
3 11 (1) 5 (2) 3 (2)
4 11 (1) 6 (3) 3 (1)
5 11 (2) 6 (1) 3 (1)

Note. Parentheses indicate the number of labeled segments between 0.13 and 0.8 s.

Figure 3.

A bar graph of the mean prediction accuracy in percentage by signal type. The mean values are as follows. All types: 92 percent. Type 1: 96 percent. Type 2: 81 percent. Type 3: 93 percent.

Mean prediction accuracy by signal type across fivefold cross-validation. Error bar represents 95% confidence intervals.

Discussion

This study examined the utility of a signal typing tool designed to facilitate the labeling of voice segments with different signal types. The goal was to advance current signal typing practices for voice recordings containing multiple signal types and to allow the calculation of signal type proportions as potential measures for pediatric voices, particularly for children with SGVS and MVS. The signal type–labeled segments also served as the basis for developing a predictive model of signal typing, which achieved high accuracy.

Manual Signal Typing and Segment Labeling

The manual signal typing tool demonstrated intrarater exact percent agreement averaging 81.7% across the three experts, with no rater showing a difference of more than one signal type when relabeling 20% of the samples. Overall interrater exact percent agreement reached 68.7%, which was lower than the intrarater agreement observed in this study and also lower than the interrater agreement reported (77%) by Ma and Yiu (2005). This difference is likely due to the fine granularity of our agreement calculation, which considered agreement based on fractional segments of a second, compared to studies that assessed agreement across entire samples. To further examine this discrepancy, we evaluated a subset of 86 out of 124 samples in which all three experts assigned a single signal type across the entire 1-s sample. Within this subset, interrater exact agreement reached 80.2%, similar to the agreement reported by Ma and Yiu (2005), supporting our speculation that segment-level signal typing introduces greater variability among experts due to its finer temporal resolution.

When disagreements occurred among experts, the difference was typically one signal type, and only a fraction of one sample showed a difference of two signal types. Interrater reliability was higher for GVS samples than SGVS samples, potentially related to SGVS having mostly Type 2 and Type 3 signals. In fact, one-type differences found in SGVS samples were mostly between Type 2 and Type 3 samples, similar to the percentage reported by Behrman et al. (1998) where 40% of Type 2 and Type 3 samples were difficult to distinguish. This limitation of manual signal typing supports the value of developing objective signal typing methods. However, despite the lower exact agreement among SGVS voices, the signal type of the samples used for predictive model development was ultimately determined by consensus among the three SLPs, increasing the validity of the typing results.

The results of the consensus signal type labeling from the three SLPs revealed that over 10% of the voice samples and 20% of the SGVS voice samples contained segments with different signal types, despite being only 1 s in length. The greater fragmentation of signal types in SGVS compared to GVS voices (9%) may be attributed to the less-stable voice production inherent in supraglottal structure use, as opposed to the true vocal fold vibration in GVS voices. Consistent with previous studies (Kelchner, Weinrich, et al., 2010; Zacharias et al., 2016), SGVS voices exhibited a higher prevalence of Type 2 and Type 3 signals, indicating greater overall severity. This further supports the notion that measures dependent on accurate F0 estimation may be less useful in this population (Baker et al., 2006; Baker Brehm et al., 2009). To support the assessment of SGVS voices, segment labeling of signal types allows for the calculation of signal type proportions within a recorded signal, which can potentially serve as valuable outcome measures for voice treatment. Changes in the proportions of Type 1 have demonstrated greater sensitivity to pre–post intervention effects compared to perturbation measures and CPPS (Kopf et al., 2017; Liu et al., 2023). Furthermore, automated signal typing algorithms can enhance the objectivity of these proportion measures.

Development of a Predictive Model for Signal Typing

Using signal-type–labeled segmented samples, this study also aimed to develop a predictive model for signal type based on measures associated with periodicity and voice quality. The logistic regression model incorporating CPPS, EnvSD8, and sharpness achieved strong prediction accuracy across all signal types in the test set, as validated by fivefold cross-validation. This high accuracy across all types suggests an improvement over the VTC analysis using the diffusive chaos method (Liu et al., 2019), which estimates the proportion of each VTC within a signal. Although Liu et al. (2019) observed significant differences in VTC1 proportions across signal types, their method did not significantly differentiate between Type 2 and Type 3 signals based on any VTC measure.

Our model also achieved higher overall higher accuracy in cross-validation (92%) compared to the approximately 83% reported by Miramont et al. (2022) who used a support vector machine trained on selected features such as shimmer, HNR, CPPS, and nonlinear dynamic measures. Notably, our model performed especially well in classifying Type 1 signals, achieving higher accuracy than their reported 86% for Type 1 classification. In both studies, misclassified Type 1 samples were categorized as Type 2; however, our model produced fewer of these misclassifications, suggesting improved differentiation between Type 1 and Type 2 signals. This distinction between Type 1 and Type 2 is particularly important, given that one of the major purposes of signal typing is to validate the use of perturbation measures, which are considered reliable only for Type 1 signals (Titze, 1995).

A potential reason our predictive model was able to distinguish more effectively between Type 1 and Type 2, as well as between Type 2 and Type 3 signals, is the use of labeled segments that contained only a single signal type, as determined by consensus of SLPs. This contrasts with the 1-s vowel segments used by Liu et al. (2019) and the entire vowel samples used by Miramont et al. (2022), both of which may have contained mixed distributions of signal types even within the sample. Additionally, our measures did not include F0-dependent perturbation measures, as used in Miramont et al. (2022), and this may have contributed to the improved accuracy in Type 1 classification. Thus, as hypothesized, this study supports the use of signal type labeling and measures strongly associated with signal periodicity and voice quality for predicting signal type.

In our prediction model, EnvSD8, which demonstrated the highest Wald value, appeared to be the strongest contributing predictor. EnvSD8 reflects the degree of amplitude modulation around a center frequency of 108 Hz, as perceived by the auditory system (Park et al., 2023). This measure has been strongly associated with perceived roughness in pediatric voices, as amplitude modulation within this low-frequency range is related to roughness perception (Fastl & Zwicker, 2007). Given the generally higher F0 in pediatric voices typically above 200 Hz (Maturo et al., 2012; Nicollas et al., 2008), the 108-Hz modulation frequency in temporal amplitude may also reflect subharmonics, which are strongly associated with perceived vocal roughness (Awan & Awan, 2020; Kramer et al., 2013; Latoszek et al., 2018). Amplitude modulation at low modulation frequencies has been associated with the presence of subharmonics (Omori et al., 1997), and amplitude modulation has been used to create subharmonics in previous studies (Bergan & Titze, 2001; Sakakibara et al., 2011). In fact, Type 2 signals are defined by the presence of subharmonics. Therefore, EnvSD8, potentially associated with the degree of subharmonics in pediatric voices, may contribute to signal typing by reflecting subharmonic content. Similarly, for adult voices, EnvSD7 (center frequency = 67 Hz) might be a useful predictor of signal type, considering their lower F0.

CPPS, which exhibited a slightly lower Wald value than EnvSD8, is a measure reflecting the degree of periodicity in the signal, a major attribute in determining signal type. Because of its association with signal periodicity, CPPS has been strongly associated with overall dysphonia severity and breathiness. CPPS is incorporated into many multivariate model of overall severity, such as the Acoustic Voice Quality Index (Maryn et al., 2010) and Cepstral Spectral Index of Dysphonia (Awan et al., 2016), which demonstrated significant prediction of voice severity. Thus, it is consistent that CPPS also contributes to the prediction of signal typing, as higher signal types are also associated with increased dysphonia and aperiodicity.

Sharpness, a perceptual quality associated with the spectral distribution of the signal, also contributed to the high accuracy of the prediction model. Sharpness is related to the prominence of mid-to-high frequency content compared to low frequency content (Fastl & Zwicker, 2007). Due to these characteristics, sharpness has been strongly associated with perceived vocal strain (Anand et al., 2019; Park et al., 2023), a quality attributed to increased vocal effort, which can increase mid-to-high frequency spectral energy (Cantor-Cutiva et al., 2023; Liénard, 2019; McKenna & Stepp, 2018). Spectral distribution has also been associated with overall dysphonia severity (Awan et al., 2016; Maryn et al., 2010), as dysphonic voices are likely to include more mid-to-high frequency energy caused by increased vocal effort and aspiration noise from incomplete glottal closure (de Krom, 1995; Klatt & Klatt, 1990; Lowell et al., 2012). Given that higher signal types are observed in more severe dysphonic voices, sharpness estimates could aid in predicting higher signal types as observed in this study.

The inclusion of sharpness in this model particularly seemed to improve the accuracy of Type 2 signal prediction when comparing the test set results of the model with CPPS and the model with pitch strength. The two models differed by the presence of pitch strength versus CPPS, and the inclusion of sharpness. Because of the strong correlations between CPPS and pitch strength, the improved accuracy in predicting Type 2 signals is likely attributed to the inclusion of sharpness. More detailed analysis of the model's predictions revealed that the model without sharpness miscategorized Type 2 signals as Type 1, whereas the model with sharpness more accurately categorized these signals as Type 2. Thus, accounting for the spectral content improved the ability to distinguish Type 2 signals from Type 1 signals in this study, likely due to the increased mid-to-high frequency spectral content present in Type 2 signals.

Despite superior overall performance of the model with CPPS, pitch strength still demonstrated significant potential as a predictor of signal type. As discussed earlier, the difference in performance between the two models is likely due to the exclusion of sharpness from the pitch strength model. The reason why sharpness was initially found to be insignificant in the pitch strength model but was significant when included in the model with CPPS remains unclear. This discrepancy may stem from subtle differences between pitch strength and CPPS, despite their strong correlation. Awan et al. (2024) observed that while both pitch strength and CPPS strongly correlated with overall severity and perceived breathiness, pitch strength showed moderate significant correlations with strain and sharpness, whereas CPPS did not. This stronger correlation with sharpness in pitch strength may explain why sharpness was excluded from the initial pitch strength model. Nonetheless, the model with pitch strength demonstrated 76% accuracy in Type 1 prediction and 83% accuracy in Type 3 prediction. These results correspond to previous studies where pitch strength has strongly differentiated signal types in both adult and pediatric voices (Anand et al., 2021; Park et al., 2023). Therefore, pitch strength warrants further investigation as a potential predictor of signal type.

Clinical Applications and Future Directions

The manual signal typing tool with segment labeling developed in this study can support signal typing prior to applying perturbation and F0-based measures, which remain widely used in clinical research (Mitsuhashi et al., 2025; Petrizzo & Popolo, 2021; Saeedi et al., 2023; Toptas et al., 2025; Yam et al., 2025). Despite existing recommendations regarding signal type and considerations for the validity of perturbation measure validity, many studies still do not report signal types. Incorporating signal typing can improve the validity of these measures in both research and clinical contexts.

The manual typing program could be further developed to provide acoustic measures based on signal type labels. Since the accuracy of perturbation and F0-based measures declines in the presence of Type 2 and Type 3 signals, labeling segments within a signal allows for the selective extraction of F0-based measures from Type 1 portions only. This approach could enhance the validity of these measures if higher type portions are intermittently present in otherwise predominantly Type 1 signals.

For individuals with voices that contain signal types above Type 2, alternative voice quality measures that are less dependent on accurate F0 tracking should be considered. For such cases, this tool may support the use of signal type proportion measures, particularly relevant for SGVS populations where Type 2 and Type 3 signals are prevalent (Kelchner, Weinrich, et al., 2010; Zacharias et al., 2016). The aperiodicity in the voice samples from these populations interferes with accurate F0 tracking, which limits the validity and usefulness of waveform cycle–based measures (Baker et al., 2006; Baker Brehm et al., 2009). Signal type proportion measures derived from manual signal type labeling may offer a promising alternative, although their objectivity remains limited without automation. A higher proportion of Type 2 and Type 3 signals may indicate greater severity, whereas a reduction in the proportion of higher signal types may serve as a useful indicator of treatment progress and improved voice quality. Similar measures, such as the number of estimated Type 1 windows (Kopf et al., 2017) and VTC1 (Liu et al., 2023), both of which reflect the proportions of Type 1–like signals, have increased following successful voice therapy in adult populations, consistent with increased periodicity. Therefore, signal type proportion measures from segment-level labeling may be equally useful for assessing adult dysphonia.

To further enhance the utility of this tool, future development should focus on incorporating time-segmented automated signal typing. Such an approach would automate signal type labeling within voice signals, enabling objective validation of F0-based measures and providing quantifiable signal type proportion measures. Future research can build on the measures investigated in this study to develop an automated algorithm for generating a time-dependent signal typing vector, which will require optimization of window size, feature selection, and validation against labeled data. Time-segmented automated signal typing could also be extended to connected speech or sentence-level stimuli, which would further support the assessment of pediatric and adult dysphonia.

Limitations

The predictive model developed in this study was trained on labeled portions of voice signals segmented so that they contain only a single, consistent signal type per voice segment. As such, the model in its current form may not perform optimally on signals that contain multiple signal types. However, this design choice was intentional; training on uniform data was intended to improve the model's ability to accurately characterize each signal type by eliminating the confounding influence of mixed signals. The strong accuracy observed in this model supports the success of this approach and the relevance of the selected parameters to signal typing. The knowledge gained from this study can serve as a foundation for more refined automated signal typing algorithms, capable of analyzing signal type within moving time windows to generate a signal typing vector, as discussed earlier. Future development of such algorithms could benefit from integration with more advanced machine learning or artificial intelligence methods.

Our study also did not account for the possibility of Type 4 within Type 3 signals, and as a result, the developed model does not distinguish between them. This exclusion was intentional, as the clinical value of distinguishing Type 3 from Type 4 was considered limited compared to the additional difficulty it would pose for the manual signal typing task. For validation of perturbation measures, distinguishing Type 1 from higher signal types is most important, and for assessing SGVS voices, given that Type 3 signals already indicate severe dysphonia (Deborah & Samayan, 2024; Lopes et al., 2020), the proportions of Type 2 and Type 3 were considered sufficient to represent varying levels of dysphonia severity. Additionally, Deborah and Samayan (2024) in a study of 457 Indian school-aged children (ages 4–16 years), reported that they were unable to identify any Type 4 signals in sustained /ɑ/ recordings. Similarly, our samples exhibited a relatively low proportion of Type 3 signals, and including Type 4 would have further reduced the sample size for Type 3, potentially compromising model training and classification accuracy.

Conclusion

This study used a manual signal typing tool to facilitate the labeling of different signal types within voice samples and developed a predictive model of signal typing using signal type–labeled samples. Pediatric voices with GVS and SGVS were examined using the manual signal typing tool to support the evaluation of pediatric voice, especially in children with SGVS, for whom adequate measures are lacking. The labeling tool demonstrated potential utility, as portions of our voice samples, particularly SGVS voices, exhibited two or more signal types even within 1-s vowel samples. The predictive model developed with CPPS, EnvSD8, and sharpness exhibited high accuracy across signal types. The measures investigated in this study have the potential to be used in further refinement of automated signal typing tools, which could facilitate objective signal typing and potentially provide signal type proportion measures for clinical use.

Data Availability Statement

The published data are available from the corresponding author upon reasonable request.

Acknowledgments

This work was supported by National Institutes of Health Grant NIH R01 DC018008 (awarded to David A. Eddins, Alessandro de Alarcon, and Rahul Shrivastav). The authors would like to thank Kevin McElfresh for assistance with the voice sample collection and Mark Skowronski and Erol Ozmeral for contributions to software development.

Funding Statement

This work was supported by National Institutes of Health Grant NIH R01 DC018008 (awarded to David A. Eddins, Alessandro de Alarcon, and Rahul Shrivastav).

References

  1. Anand, S., Kopf, L. M., Shrivastav, R., & Eddins, D. A. (2019). Objective indices of perceived vocal strain. Journal of Voice, 33(6), 838–845. 10.1016/j.jvoice.2018.06.005 [DOI] [PubMed] [Google Scholar]
  2. Anand, S., Kopf, L. M., Shrivastav, R., & Eddins, D. A. (2021). Using pitch height and pitch strength to characterize type 1, 2, and 3 voice signals. Journal of Voice, 35(2), 181–193. 10.1016/j.jvoice.2019.08.006 [DOI] [PubMed] [Google Scholar]
  3. Angelillo, N., Di Costanzo, B., Angelillo, M., Costa, G., Barillari, M. R., & Barillari, U. (2008). Epidemiological study on vocal disorders in paediatric age. Journal of Preventive Medicine and Hygiene, 49(1), 1–5. https://www.ncbi.nlm.nih.gov/pubmed/18792527 [PubMed] [Google Scholar]
  4. Awan, S. N., & Awan, J. A. (2020). A two-stage cepstral analysis procedure for the classification of rough voices. Journal of Voice, 34(1), 9–19. 10.1016/j.jvoice.2018.07.003 [DOI] [PubMed] [Google Scholar]
  5. Awan, S. N., Park, Y., Anand, S., Shrivastav, R., & Eddins, D. A. (2024, May 29–June 1). The relationship between cepstral peak prominence and pitch strength in complex dysphonia [Paper presentation]. 53rd Annual Voice Foundation Symposium, Philadelphia, PA, United States. [Google Scholar]
  6. Awan, S. N., Roy, N., Zhang, D., & Cohen, S. M. (2016). Validation of the Cepstral Spectral Index of Dysphonia (CSID) as a screening tool for voice disorders: Development of clinical cutoff scores. Journal of Voice, 30(2), 130–144. 10.1016/j.jvoice.2015.04.009 [DOI] [PubMed] [Google Scholar]
  7. Baker, S., Kelchner, L., Weinrich, B., Lee, L., Willging, P., Cotton, R., & Zur, K. (2006). Pediatric laryngotracheal stenosis and airway reconstruction: A review of voice outcomes, assessment, and treatment issues. Journal of Voice, 20(4), 631–641. 10.1016/j.jvoice.2005.08.012 [DOI] [PubMed] [Google Scholar]
  8. Baker Brehm, S., Weinrich, B., Zieser, M., Kelchner, L., Middendorf, J., Elluru, R., & de Alarcon, A. (2009). Aerodynamic and acoustic assessment in children following airway reconstruction: An assessment of feasibility. International Journal of Pediatric Otorhinolaryngology, 73(7), 1019–1023. 10.1016/j.ijporl.2009.04.001 [DOI] [PubMed] [Google Scholar]
  9. Behrman, A., Agresti, C. J., Blumstein, E., & Lee, N. (1998). Microphone and electroglottographic data from dysphonic patients: Type 1, 2 and 3 signals. Journal of Voice, 12(2), 249–260. 10.1016/s0892-1997(98)80045-3 [DOI] [PubMed] [Google Scholar]
  10. Bergan, C. C., & Titze, I. R. (2001). Perception of pitch and roughness in vocal signals with subharmonics. Journal of Voice, 15(2), 165–175. 10.1016/S0892-1997(01)00018-2 [DOI] [PubMed] [Google Scholar]
  11. Boersma, P., & Weenink, D. (2021). Praat: Doing phonetics by computer (Version 6.1.50) [Computer software]. http://www.praat.org/
  12. Camacho, A. (2012, July 2–5). On the use of auditory models' elements to enhance a sawtooth waveform inspired pitch estimator on telephone-quality signals [Paper presentation]. 11th international Conference on Information Science, Signal Processing and their Applications (ISSPA), Montreal, QC, Canada. 10.1109/ISSPA.2012.6310450 [DOI] [Google Scholar]
  13. Cantor-Cutiva, L. C., Calvache Mora, C., Rincon-Pinilla, M. P., & Hunter, E. J. (2023). Association between bilingualism, vocal effort, and background noise with voice fundamental frequency and LTAS among Spanish-English bilingual teachers. Journal of Voice. Advance online publication. 10.1016/j.jvoice.2023.09.001 [DOI] [Google Scholar]
  14. Carding, P. N., Roulstone, S., Northstone, K., & ALSPAC Study Team. (2006). The prevalence of childhood dysphonia: A cross-sectional study. Journal of Voice, 20(4), 623–630. 10.1016/j.jvoice.2005.07.004 [DOI] [PubMed] [Google Scholar]
  15. Clary, R. A., Pengilly, A., Bailey, M., Jones, N., Albert, D., Comins, J., & Appleton, J. (1996). Analysis of voice outcomes in pediatric patients following surgical procedures for laryngotracheal stenosis. Archives of Otolaryngology—Head & Neck Surgery, 122(11), 1189–1194. 10.1001/archotol.1996.01890230035008 [DOI] [PubMed] [Google Scholar]
  16. Dau, T., Kollmeier, B., & Kohlrausch, A. (1997). Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. The Journal of the Acoustical Society of America, 102(5), 2892–2905. 10.1121/1.420344 [DOI] [PubMed] [Google Scholar]
  17. Deborah, R., & Samayan, K. (2024). Signal typing and acoustic analysis of voice in school-aged Indian children. Journal of Voice. Advance online publication. 10.1016/j.jvoice.2024.09.024 [DOI] [Google Scholar]
  18. de Krom, G. (1995). Some spectral correlates of pathological breathy and rough voice quality for different types of vowel fragments. Journal of Speech and Hearing Research, 38(4), 794–811. 10.1044/jshr.3804.794 [DOI] [PubMed] [Google Scholar]
  19. Eddins, D. A., Anand, S., Camacho, A., & Shrivastav, R. (2016). Modeling of breathy voice quality using pitch-strength estimates. Journal of Voice, 30(6), 774.e1–774.e7. 10.1016/j.jvoice.2015.11.016 [DOI] [Google Scholar]
  20. Eddins, D. A., Kopf, L. M., & Shrivastav, R. (2015). The psychophysics of roughness applied to dysphonic voice. The Journal of the Acoustical Society of America, 138(6), 3820–3825. 10.1121/1.4937753 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Fastl, H., & Zwicker, E. (2007). Psychoacoustics: Facts and models (3rd ed.). Springer. 10.1007/978-3-540-68888-4 [DOI] [Google Scholar]
  22. Fujiki, R. B., & Thibeault, S. L. (2024). Voice disorder prevalence and vocal health characteristics in children. JAMA Otolaryngology–Head & Neck Surgery, 150(8), 677–687. 10.1001/jamaoto.2024.1516 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Heller Murray, E. S., Chao, A., & Colletti, L. (2025). A practical guide to calculating cepstral peak prominence in Praat. Journal of Voice, 39(2), 365–370. 10.1016/j.jvoice.2022.09.002 [DOI] [PubMed] [Google Scholar]
  24. Heman-Ackah, Y. D., Heuer, R. J., Michael, D. D., Ostrowski, R., Horman, M., Baroody, M. M., Hillenbrand, J., & Sataloff, R. T. (2003). Cepstral peak prominence: A more reliable measure of dysphonia. Annals of Otology, Rhinology & Laryngology, 112(4), 324–333. 10.1177/000348940311200406 [DOI] [PubMed] [Google Scholar]
  25. Hillenbrand, J., Cleveland, R. A., & Erickson, R. L. (1994). Acoustic correlates of breathy vocal quality. Journal of Speech and Hearing Research, 37(4), 769–778. 10.1044/jshr.3704.769 [DOI] [PubMed] [Google Scholar]
  26. Hillman, R. E., Holmberg, E. B., Perkell, J. S., Walsh, M., & Vaughan, C. (1989). Objective assessment of vocal hyperfunction: An experimental framework and initial results. Journal of Speech and Hearing Research, 32(2), 373–392. 10.1044/jshr.3202.373 [DOI] [PubMed] [Google Scholar]
  27. Johnson, C. M., Anderson, D. C., & Brigger, M. T. (2020). Pediatric dysphonia: A cross-sectional survey of subspecialty and primary care clinics. Journal of Voice, 34(2), 301.e1–301.e5. 10.1016/j.jvoice.2018.08.017 [DOI] [Google Scholar]
  28. Johnston, R., Jones, K., & Manley, D. (2018). Confounding and collinearity in regression analysis: A cautionary tale and an alternative procedure, illustrated by studies of British voting behaviour. Quality & Quantity, 52(4), 1957–1976. 10.1007/s11135-017-0584-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kelchner, L. N., Brehm, S. B., Weinrich, B., Middendorf, J., deAlarcon, A., Levin, L., & Elluru, R. (2010). Perceptual evaluation of severe pediatric voice disorders: Rater reliability using the Consensus Auditory Perceptual Evaluation of Voice. Journal of Voice, 24(4), 441–449. 10.1016/j.jvoice.2008.09.004 [DOI] [PubMed] [Google Scholar]
  30. Kelchner, L. N., Weinrich, B., Baker Brehm, S., Tabangin, M. E., & de Alarcon, A. (2010). Characterization of supraglottic phonation in children after airway reconstruction. Annals of Otology, Rhinology & Laryngology, 119(6), 383–390. 10.1177/000348941011900604 [DOI] [PubMed] [Google Scholar]
  31. Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820–857. 10.1121/1.398894 [DOI] [PubMed] [Google Scholar]
  32. Kopf, L. M., Jackson-Menaldi, C., Rubin, A. D., Skeffington, J., Hunter, E. J., Skowronski, M. D., & Shrivastav, R. (2017). Pitch strength as an outcome measure for treatment of dysphonia. Journal of Voice, 31(6), 691–696. 10.1016/j.jvoice.2017.01.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kramer, E., Linder, R., & Schönweiler, R. (2013). A study of subharmonics in connected speech material. Journal of Voice, 27(1), 29–38. 10.1016/j.jvoice.2012.08.005 [DOI] [PubMed] [Google Scholar]
  34. Krival, K., Kelchner, L. N., Weinrich, B., Baker, S. E., Lee, L., Middendorf, J. H., & Zur, K. B. (2007). Vibratory source, vocal quality and fundamental frequency following pediatric laryngotracheal reconstruction. International Journal of Pediatric Otorhinolaryngology, 71(8), 1261–1269. 10.1016/j.ijporl.2007.04.018 [DOI] [PubMed] [Google Scholar]
  35. Latoszek, B. B. v., Maryn, Y., Gerrits, E., & De Bodt, M. (2018). A meta-analysis: Acoustic measurement of roughness and breathiness. Journal of Speech, Language, and Hearing Research, 61(2), 298–323. 10.1044/2017_JSLHR-S-16-0188 [DOI] [Google Scholar]
  36. Lechien, J. R., Geneid, A., Bohlender, J. E., Cantarella, G., Avellaneda, J. C., Desuter, G., Sjogren, E. V., Finck, C., Hans, S., Hess, M., Oguz, H., Remacle, M. J., Schneider-Stickler, B., Tedla, M., Schindler, A., Vilaseca, I., Zabrodsky, M., Dikkers, F. G., & Crevier-Buchman, L. (2023). Consensus for voice quality assessment in clinical practice: Guidelines of the European Laryngological Society and Union of the European Phoniatricians. European Archives of Oto-Rhino-Laryngology, 280(12), 5459–5473. 10.1007/s00405-023-08211-6 [DOI] [PubMed] [Google Scholar]
  37. Liénard, J. S. (2019). Quantifying vocal effort from the shape of the one-third octave long-term-average spectrum of speech. The Journal of the Acoustical Society of America, 146(4), EL369–EL375. 10.1121/1.5129677 [DOI] [PubMed] [Google Scholar]
  38. Liu, B., Hao, G., Cui, Y., Fang, J., Ji, M., Wu, J., Jiang, J., Shao, J., & Liu, H. (2023). Introduction of voice type component (VTC) as an effective acoustic voice analysis method in tele-evaluation. Journal of Voice. Advance online publication. 10.1016/j.jvoice.2023.07.005 [DOI] [Google Scholar]
  39. Liu, B., Polce, E., & Jiang, J. (2018). Application of local intrinsic dimension for acoustical analysis of voice signal components. Annals of Otology, Rhinology & Laryngology, 127(9), 588–597. 10.1177/0003489418780439 [DOI] [PubMed] [Google Scholar]
  40. Liu, B., Polce, E., Raj, H., & Jiang, J. (2019). Quantification of voice type components present in human phonation using a modified diffusive chaos technique. Annals of Otology, Rhinology & Laryngology, 128(10), 921–931. 10.1177/0003489419848451 [DOI] [PubMed] [Google Scholar]
  41. Lopes, L. W., da Silva, I. M., da Silva Sousa, E. S., da Silva, A. C. F., de Paiva, M. A. A., Diniz, E. G. R., & Silva, P. O. C. (2020). Classificação espectrográfica do sinal vocal: Relação com o diagnóstico laríngeo e a análise perceptivo-auditiva [Spectrographic classification of the vocal signal: Relation with laryngeal diagnosis and auditory-perceptual analysis]. Communication Research, 25, Article e2194. 10.1590/2317-6431-2019-2194 [DOI] [Google Scholar]
  42. Lowell, S. Y., Kelley, R. T., Awan, S. N., Colton, R. H., & Chan, N. H. (2012). Spectral- and cepstral-based acoustic features of dysphonic, strained voice quality. Annals of Otology, Rhinology & Laryngology, 121(8), 539–548. 10.1177/000348941212100808 [DOI] [PubMed] [Google Scholar]
  43. Ma, E. P.-M., & Yiu, E. M.-L. (2005). Suitability of acoustic perturbation measures in analysing periodic and nearly periodic voice signals. Folia Phoniatrica et Logopaedica, 57(1), 38–47. 10.1159/000081960 [DOI] [PubMed] [Google Scholar]
  44. Martins, R. H. G., Hidalgo Ribeiro, C. B., Fernandes de Mello, B. M. Z., Branco, A., & Tavares, E. L. M. (2012). Dysphonia in children. Journal of Voice, 26(5), 674.e17–674.E20. 10.1016/j.jvoice.2012.03.004 [DOI] [Google Scholar]
  45. Maryn, Y., De Bodt, M., & Roy, N. (2010). The Acoustic Voice Quality Index: Toward improved treatment outcomes assessment in voice disorders. Journal of Communication Disorders, 43(3), 161–174. 10.1016/j.jcomdis.2009.12.004 [DOI] [PubMed] [Google Scholar]
  46. Maturo, S., Hill, C., Bunting, G., Ballif, C., Maurer, R., & Hartnick, C. (2012). Establishment of a normative pediatric acoustic database. Archives of Otolaryngology—Head & Neck Surgery, 138(10), 956–961. 10.1001/2013.jamaoto.104 [DOI] [PubMed] [Google Scholar]
  47. McKenna, V. S., & Stepp, C. E. (2018). The relationship between acoustical and perceptual measures of vocal effort. The Journal of the Acoustical Society of America, 144(3), 1643–1658. 10.1121/1.5055234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Mehta, D. D., & Hillman, R. E. (2008). Voice assessment: Updates on perceptual, acoustic, aerodynamic, and endoscopic imaging methods. Current Opinion in Otolaryngology & Head and Neck Surgery, 16(3), 211–215. 10.1097/MOO.0b013e3282fe96ce [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Merati, A. L., Keppel, K., Braun, N. M., Blumin, J. H., & Kerschner, J. E. (2008). Pediatric Voice-Related Quality of Life: Findings in healthy children and in common laryngeal disorders. Annals of Otology, Rhinology & Laryngology, 117(4), 259–262. 10.1177/000348940811700404 [DOI] [PubMed] [Google Scholar]
  50. Miramont, J. M., Restrepo, J. F., Codino, J., Jackson-Menaldi, C., & Schlotthauer, G. (2022). Voice signal typing using a pattern recognition approach. Journal of Voice, 36(1), 34–42. 10.1016/j.jvoice.2020.03.006 [DOI] [PubMed] [Google Scholar]
  51. Mitsuhashi, T., Umeno, H., Chitose, S. I., Kurita, T., Furukawa, K., Sato, F., Sato, K., Fukahori, M., Ono, T., Sueyoshi, S., & Hamakawa, S. (2025). Comparative analysis of preoperative and postoperative voice parameters in 93 cases of vocal fold cyst. Journal of Voice. Advance online publication. 10.1016/j.jvoice.2025.03.026 [DOI] [Google Scholar]
  52. Nicollas, R., Garrel, R., Ouaknine, M., Giovanni, A., Nazarian, B., & Triglia, J. M. (2008). Normal voice in children between 6 and 12 years of age: Database and nonlinear analysis. Journal of Voice, 22(6), 671–675. 10.1016/j.jvoice.2007.01.009 [DOI] [PubMed] [Google Scholar]
  53. Omori, K., Kojima, H., Kakani, R., Slavit, D. H., & Blaugrund, S. M. (1997). Acoustic characteristics of rough voice: Subharmonics. Journal of Voice, 11(1), 40–47. 10.1016/s0892-1997(97)80022-7 [DOI] [PubMed] [Google Scholar]
  54. Park, Y., Anand, S., Baker Brehm, S., Kelchner, L., Weinrich, B., McElfresh, K., Shrivastav, R., de Alarcon, A., & Eddins, D. A. (2025). Multidimensional perceptual structure of pediatric dysphonia in children with glottal and supraglottal vibratory sources [Manuscript submitted for publication].
  55. Park, Y., Anand, S., Gifford, S. M., Shrivastav, R., & Eddins, D. A. (2023). Development and validation of a single-variable comparison stimulus for matching strained voice quality using a psychoacoustic framework. Journal of Speech, Language, and Hearing Research, 66(1), 16–29. 10.1044/2022_JSLHR-22-00280 [DOI] [Google Scholar]
  56. Park, Y., Anand, S., Ozmeral, E. J., Shrivastav, R., & Eddins, D. A. (2022). Predicting perceived vocal roughness using a bio-inspired computational model of auditory temporal envelope processing. Journal of Speech, Language, and Hearing Research, 65(8), 2748–2758. 10.1044/2022_JSLHR-22-00101 [DOI] [Google Scholar]
  57. Park, Y., Baker Brehm, S., Kelchner, L., Weinrich, B., McElfresh, K., Anand, S., Shrivastav, R., de Alarcon, A., & Eddins, D. A. (2023). Effects of vibratory source on auditory-perceptual and bio-inspired computational measures of pediatric voice quality. Journal of Voice. Advance online publication. 10.1016/j.jvoice.2023.08.016 [DOI] [Google Scholar]
  58. Patel, R. R., Awan, S. N., Barkmeier-Kraemer, J., Courey, M., Deliyski, D., Eadie, T., Paul, D., Svec, J. G., & Hillman, R. (2018). Recommended protocols for instrumental assessment of voice: American Speech-Language-Hearing Association expert panel to develop a protocol for instrumental assessment of vocal function. American Journal of Speech-Language Pathology, 27(3), 887–905. 10.1044/2018_AJSLP-17-0009 [DOI] [PubMed] [Google Scholar]
  59. Petrizzo, D., & Popolo, P. S. (2021). Smartphone use in clinical voice recording and acoustic analysis: A literature review. Journal of Voice, 35(3), 499.e23–499.e28. 10.1016/j.jvoice.2019.10.006 [DOI] [Google Scholar]
  60. Saeedi, S., Dabirmoghaddam, P., Soleimani, M., & Aghajanzadeh, M. (2023). Relationship among five-factor personality traits and psychological distress with acoustic analysis. Laryngoscope Investigative Otolaryngology, 8(4), 996–1006. 10.1002/lio2.1119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Sakakibara, K., Imagawa, H., Yokonishi, H., Kimura, M., & Tayama, N. (2011, October 18–21). Physiological observations and synthesis of subharmonic voices [Paper presentation]. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Xi'an, China. http://www.apsipa.org/proceedings_2011/pdf/APSIPA311.pdf [PDF] [Google Scholar]
  62. Shrivastav, R., Camacho, A., Patel, S., & Eddins, D. A. (2011). A model for the prediction of breathiness in vowels. The Journal of the Acoustical Society of America, 129(3), 1605–1615. 10.1121/1.3543993 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Smith, M. E., Marsh, J. H., Cotton, R. T., & Myer, C. M., III. (1993). Voice problems after pediatric laryngotracheal reconstruction: Videolaryngostroboscopic, acoustic, and perceptual assessment. International Journal of Pediatric Otorhinolaryngology, 25(1–3), 173–181. 10.1016/0165-5876(93)90051-4 [DOI] [PubMed] [Google Scholar]
  64. Sprecher, A., Olszewski, A., Jiang, J. J., & Zhang, Y. (2010). Updating signal typing in voice: Addition of type 4 signals. The Journal of the Acoustical Society of America, 127(6), 3710–3716. 10.1121/1.3397477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Titze, I. R. (1995). Workshop on acoustic voice analysis: Summary statement. National Center for Voice and Speech. https://ncvs.org/archive/freebooks/summary-statement.pdf [PDF] [Google Scholar]
  66. Toptas, G., Tatar, E. C., Keseroglu, K., Sahin, M., Barmak, E., Altan, E., & Korkmaz, M. H. (2025). Does voice therapy after endolarygeal microsurgery contribute to the treatment of vocal fold polyp? Folia Phoniatrica et Logopaedica, 77(5), 495–505. 10.1159/000545213 [DOI] [PubMed] [Google Scholar]
  67. Weinrich, B., Baker, S., Kelchner, L., Middendorf, J., Krival, K., Elluru, R., & Zur, K. B. (2007). Examination of aerodynamic measures and strain by vibratory source. Otolaryngology—Head and Neck Surgery, 136(3), 455–458. 10.1016/j.otohns.2006.11.052 [DOI] [PubMed] [Google Scholar]
  68. Yam, C., Ranjbar, P. A., Yeakel, H., Alnouri, G., & Sataloff, R. T. (2025). The association of objective voice measurement with treatment of vocal fold masses. Journal of Voice, 39(2), 505–511. 10.1016/j.jvoice.2022.09.028 [DOI] [Google Scholar]
  69. Zacharias, S. R. C., Myer, C. M., IV, Meinzen-Derr, J., Kelchner, L., Deliyski, D. D., & de Alarcon, A. (2016). Comparison of videostroboscopy and high-speed videoendoscopy in evaluation of supraglottic phonation. Annals of Otology, Rhinology & Laryngology, 125(10), 829–837. 10.1177/0003489416656205 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The published data are available from the corresponding author upon reasonable request.


Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES