Abstract
Objective:
Perceived voice quality (VQ) ratings provide data with ordinal characteristics on arbitrary scales regarding the relative order of VQ disruptions ranging from mild to severe quality. The overarching goal of this research is to develop a new and rigorous method for perceptual evaluation of VQ using quantitative comparisons and facilitating standardized comparisons across time points, clinicians, and clinical sites. Our prior research developed standard ratio-level scales of breathy and rough VQ analogous to the sone scale that (a) have physical units, (b) are strongly related to psychophysical measures, and (c) can quantify not just the direction but also the magnitude of change. The current study reestablishes the standard reference points, validates the newly developed VQ scales with natural dysphonic stimuli, and evaluates the psychometric measurement properties of the novel scales for breathiness and roughness in a small clinical pilot experiment.
Method:
In the first experiment, a set of magnitude estimation (ME) tasks were first used to determine the perceived magnitudes of breathiness and roughness of 10 natural voice stimuli in each VQ continuum. The resulting data were compared with previously acquired magnitude estimates of the synthetic comparison stimuli to adjust the reference points. In the second experiment, a set of inexperienced listeners evaluated the same set of natural dysphonic stimuli using the new clinical scales with perceived magnitude expressed in standard VQ units. In the third experiment, two expert clinicians evaluated 12 breathy and 10 rough dysphonic voices (pre- and posttreatment) using the new VQ scales.
Results:
The standard reference units were identified as 15 dB SNR for breathiness and −26 dB modulation depth for roughness. The strength of the relationship between the ME data and the predicted values from the clinical scales was high for both breathiness and roughness (r > .9). Treatment outcomes measured using the newly developed scales demonstrated high intra- (r > .8) and interrater reliability (r > .8) when compared to Consensus Auditory-Perceptual Evaluation of Voice providing evidence for concurrent validity of the clinical scales.
Conclusion:
Such formal VQ scales support valid quantitative comparisons of perceptual judgments and represent a critical step in clinical translation.
Change in voice quality (VQ) is a hallmark of voice disorders and can be an indicator of disorder severity, progression, and treatment outcome. For this reason, accurate characterization of VQ is a crucial component of clinical voice assessment. As such, comprehensive knowledge and understanding of how listeners perceive VQ is integral to the advancement of voice research and clinical practice. Prior to 2009, perceptual evaluation of VQ was completed using a wide range of tools across clinics including GRBAS (grade, roughness, breathiness, asthenia, strain; Hirano, 1981). The development of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V; Kempster et al., 2009) represented a clinical breakthrough by providing a formal and structured evaluation tool that included live sampling of multiple utterance types and simultaneous, multidimensional perceptual VQ assessment. Its widespread use (~2010 to today; Kempster et al., 2025) has elevated the rigor of perceptual VQ evaluation.
However, major shortcomings of the instrument include limited rater agreement and reliability and an inability to validly quantify the magnitude of change. For example, research by Zraick et al. (2011) reported intrarater reliability ranging from 0.35 (strain) to 0.82 (breathiness) and interrater reliability ranging from 0.28 (pitch) to 0.76 (overall severity). Alternatively, studies that report good to strong reliability data for CAPE-V ratings are often dependent upon the degree of training, use of anchors, and/or the voice characteristic to be rated (e.g., Karnell et al., 2007; Nemr et al., 2012). By its very nature, a visual analog scale (VAS; e.g., CAPE-V) provides ordinal-like data based on arbitrary numbers (Nagle, 2016, 2025; Patel et al., 2010). Instead, ratio-level data are needed to quantify magnitude of change (e.g., pre- vs, posttreatment). In addition, CAPE-V scores have no relationship to physical units or psychophysical measures. Physical units are important because they provide objective, standardized, and quantifiable ways to measure perception that allow accurate and precise comparisons across clinicians, time points, and settings. Ideal perceptual outcome measures must be valid, reliable, sensitive, and specific to the presence or absence of a disorder. In the context of VQ, it is also essential that such measures are sensitive and specific to VQ dimension. This essential information allows the voice clinician to adhere to the most accurate diagnostic and treatment decision-making processes (Desjardins et al., 2017).
Our psychoacoustic approach uses general experimental methods common to all branches of psychophysics (i.e., vision, hearing, touch, taste, and smell), including matching (MA), magnitude estimation (ME), and scaling. The process of developing a standard scale was first defined by the famous sensory scientist Stevens (1936) in the context of the loudness of sounds. Sound is quantified in units of decibel sound pressure level, which have desirable ratio-level mathematical properties needed to compute differences and change; however, the dB metric is cumbersome and lacks intuitiveness for use outside of scientific and engineering contexts. Therefore, Stevens developed a standard perceptual scale for loudness called the sone scale, and he advanced the sone as the standard unit of loudness. These efforts toward standardization were well received, and the sone scale is used widely to this day in scientific, clinical, and industry applications. Since then, such scaling has been applied to many domains.
As illustrated in Figure 1 for rough VQ, there are four essential steps of scale unit development (Eddins et al., 2021) involving two arduous laboratory tasks to develop a relationship between physical units and perception, and two mathematical steps to transform that relationship to a simple standard scale. Step 1 (physical scales; lower x-axis in Figure 1) involves establishing the relationship between the percept and physical units. To capture perception, an MA task is ideal, as it (a) minimizes contextual biases, (b) provides specific physical units derived from the comparison sound that are associated with perceived quality through the MA process, and (c) has ratio-level measurement properties that support expression of mathematical relationships in perceptual data. Here, the task involves matching the quality of a dysphonic voice to a synthetic comparison sound that is variable along each of the primary VQ dimensions. Matching to VQ results in a wide range of physical units of independent variables responsible for VQ perception, spanning normal to severely dysphonic (Park et al., 2022, 2023; Patel et al., 2012a). Step 2 (psychophysical scales; y-axis in Figure 1) involves establishing the relationship between the physical units derived from the MA task and perceived VQ magnitude using an ME task to evaluate the individual VQ dimensions. The ME task is also cumbersome to use, and to improve upon this, in Steps 3 and 4, we map the physical continua to standard scales (analogous to the development of the sone scale for loudness; American National Standards Institute S3.4, 2004), rendering perceptual VQ measurements operable in practical (clinical, scientific, professional) contexts.
Figure 1.
The transition from physical scales (lower x-axis) to standard scales (upper x-axis) for roughness is shown. The steps for developing the scales are described as text within the figure. The values corresponding to one scale unit were determined by identifying the primary inflection point in the fitted function and rounding it to the nearest discrete independent variable value used in the relevant magnitude estimation experiments.
In Steps 1 and 2, the MA and ME tasks involve perceptual judgments of a set of stimuli, with multiple repetitions and judgments for each stimulus, across multiple participants. In Step 3, the continuum of magnitude estimates from the ME task is displayed as a function of physical units and a single reference point is chosen as a standard reference for the subsequent scale. Since the direct relationship between those physical units and perception of VQ along a single dimension is known from the MA task of Step 1, the standard reference ties perceptual magnitude to physical magnitude. Step 4 involves rescaling all magnitude estimates in the continuum relative to the standard reference unit chosen in Step 3 (upper x-axis in Figure 1). To use the newly developed scale, the perceived VQ of a voice sample is judged relative to the reference value of 1 scale unit, either being some fraction or some multiple of that standard unit. Because the units now reflect a ratio-level scale, mathematical computations on the judgments can be made, such as differences to determine magnitude of clinical change.
In a previous study, Eddins et al. (2021) explored the idea of developing standard scales for breathiness and roughness using the comparison stimuli taken from the MA task as the primary stimulus. They established a reference value of 18 dB SNR for breathiness and redefined that as 1 breathiness unit. They also established an amplitude modulation depth of −27 dB for roughness and redefined that as 1 roughness unit. Through a logical extension, the current study had three main goals. Since the previous study determined the standard reference from ME data obtained solely on synthetic comparison, the first goal was to affirm/establish the standard reference points for 1 VQ unit using natural dysphonic voice samples rather than only synthetic comparison sounds used by Eddins et al. (2021). In doing so, this allows a test of the construct validity of the prior research and holds the potential to increase the face validity of the scale units. The second goal was to test the external validity of the ratio-level dimension-specific VQ scales by applying the scale to assess a unique set of natural voice samples. The third goal was to evaluate the psychometric measurement properties of the newly developed breathiness and roughness scales in a small clinical pilot experiment. Successful achievement of these goals represents a milestone in the development of a new and more rigorous method for use in the perceptual evaluation of voice that has the potential to increase accuracy, precision, and application across time, sites, and clinician judges.
Experiment 1: Establishing Standard Reference Points
The first goal of the current study was to determine whether the standard reference points of 18 dB SNR (for breathy VQ) and − 27 dB amplitude modulation depth (for rough VQ) that were previously established using synthetic comparison stimuli would still hold if the reference points were established on the basis of judgments using natural dysphonic stimuli. Experiment 1 addressed this goal, and its results informed the construct validity of VQ scaling across the wide severity continuum of dysphonic voices.
Experiment 1 Method
Experiment 1: Stimuli
A total of 44 breathy /a/ phonations from talkers with dysphonia1 were selected from two disordered voice databases. As used in prior perceptual studies (Anand, 2023; Eddins et al., 2016), 14 voices were selected from the University of Florida Dysphonic Voice Database (Anand et al., 2019) and 30 voices were selected from the Kay Elemetrics Disordered Voice Database (Kay Elemetrics Corporation, 1994). A stratified-random sampling procedure ensured that voices were primarily “breathy” with minimal occurrence of other qualities such as strain and represented a wide continuum of severity. The sustained vowels were cropped to 500-ms duration to obtain steady-state phonation. In addition to the natural dysphonic voices, 13 synthetic breathy comparison sounds were also chosen for the perceptual ME task. These 13 synthetic comparison sounds ranged in signal-to-noise ratio from 0 dB (most breathy) to 36 dB (least breathy) in 3 dB steps. This range was identified from prior breathiness MA experiments (Eddins et al., 2021; Patel et al., 2012a) to capture the continuum of breathiness severities in natural dysphonic voices.
Similarly, a total of 44 rough /a/ phonations from another set of talkers with dysphonia1 were selected from two disordered voice databases. As used in prior perceptual studies (Anand, 2023; Park et al., 2022), 10 voices were selected from the University of Florida Dysphonic Voice Database and 34 voices were selected from the Sataloff/Heman Ackah database (Heman-Ackah et al., 2002), also using a stratified-sampling procedure such that voices were primarily “rough” with minimal occurrence of other qualities such as strain and represented a wide continuum of severity. The sustained vowels were cropped to 500-ms duration to obtain steady-state phonation. In addition to the natural dysphonic voices, 13 synthetic comparison sounds were also chosen for the perceptual ME task. These 13 synthetic rough comparison sounds were amplitude modulated using a sinusoidal function of the fourth power with a 25-Hz modulation frequency (fAM), as shown in Equation 1.
| (1) |
The modulation depth (m ∈ [0, 1]) in Equation 1 was expressed in dB, as shown below, and ranged from −12 dB (most rough) to −36 dB (least rough) in 2-dB steps.
| (2) |
Here, modulation depth ranges from 0 to 1 and, therefore, mdB ranges from 0 to −∞. The range from −12 to −36 dB was identified from prior roughness MA experiments (Patel et al., 2012b) to capture the continuum of roughness severities in natural dysphonic voices.
Experiment 1: Listeners
For the breathy ME task, 10 young adult listeners (two male and eight female college students) aged 20–37 years (M ± SD = 25 ± 5 years) were recruited from the University of South Florida to participate in the breathy ME task. Similarly, 10 young adult listeners (one male and nine female students) aged 20–37 years (M ± SD = 26 ± 5 years) completed the rough ME task. For the rough ME task, seven adults who participated in the breathy ME were recruited again and three new listeners completed the task. All listeners were self-reported native speakers of American English and had hearing thresholds less than 20 dB HL via air conduction at frequencies of 250, 500, 1000, 2000, and 4000 Hz. As part of routine lab intake, the following procedures were conducted but were not used as covariates in subsequent analyses: otoscopy, tympanometry, hearing health history, noise exposure, cognitive status, and history of neurological disorders or head trauma. Listeners had no or limited background in communication sciences and disorders and voice evaluation. Each listener consented to participation in accordance with the procedures approved by the University of South Florida institutional review board (IRB Pro0012381) and was compensated for their time.
Experiment 1: Instrumentation
Stimulus presentation and response collection was controlled using the TDT SykofizX software and TDT System 3 RZ6 digital signal processing hardware with D/A module (Tucker-Davis Technologies, Inc.). Listeners were seated in a sound-treated booth, and stimuli were presented monaurally using ER-2 insert earphones (Etymotic Research Inc.) at 85 dB SPL.
Experiment 1: Procedures
For both VQ dimensions, a loudness ME task was completed first to familiarize listeners with the concept of ME. Briefly, on each trial, listeners assigned a numerical value between 1 and 1000 to reflect the loudness of a 500-ms, 1000-Hz pure tone. The presentation level varied from 60 to 92 dB SPL. A detailed description of the loudness ME task was reported in Anand (2023). An ME task for the VQ dimensions required listeners to assign a numerical value between 1 and 1000 to reflect the perceived magnitude of the respective VQ dimensions. Following the loudness ME task, listeners completed a practice condition for perceptual ME judgments on breathiness, with one presentation of the 57 stimuli (44 naturals and 13 comparisons). This was followed by the experiment condition for perceptual ME judgments on breathiness, with 10 repetitions of the 57 stimuli in one block (570 stimuli). Participants were provided with feedback on the use of the scale and their intrarater reliability (computed as intraclass correlation coefficients) on the first block. Following a brief rest period, perceptual data were processed using a custom MATLAB algorithm to (a) confirm that participants utilized the full extent of the 1 to 1,000 scale to represent perceived magnitude, (b) validate that magnitude estimates reflected proportional relationships, and (c) verify high intrarater reliability across 10 stimulus repetitions (r > .80). Subsequently, participants were presented with graphical feedback illustrating individual repetitions, average perceptual ratings, and corresponding reliability. If any of the outlined criteria were unmet, supplementary instructions were provided. Two additional experimental blocks (570 stimuli each) were completed, and the data averaged across the three experimental blocks (570 × 3 = 1,710 stimuli) were used for further analyses.
For perceptual ME judgments on roughness, listeners were familiarized with synthetic stimuli of varying modulation depths of 0, −5, −10, or −15 dB, ranging from most to least modulation depth. Practice involved estimating the perceived magnitude associated with those amplitude modulation depths. The first stimulus in the practice set was either 0, −5, or − 10 dB. Similar to the breathiness ME task, one presentation of the set of 57 rough stimuli was completed as practice, followed by 10 repetitions of 57 stimuli per block of trials (570 stimuli) for three experimental blocks. The final data set for each listener represented the average over 30 repetitions for each of the 57 stimuli.
Experiment 1 Results
For each VQ, intraclass correlation coefficients [ICC (2, k)] were used to calculate both intra- and inter-listener reliability for the perceptual data (Shrout & Fleiss, 1979). Across all stimuli, the mean intra-listener reliability, ICC (2, k), k = 30 repetitions, was 0.97, and inter-listener reliability, ICC (2, k), k = 10 listeners, was 0.92 for perceived breathiness. Similarly, the mean intra-listener reliability was 0.96, and inter-listener reliability was 0.97 for perceived roughness. Table 1 depicts these values for individual stimulus groups for each of the VQs.
Table 1.
Listener reliability for perceived breathy and rough magnitude estimates.
| Stimuli/VQ |
Breathiness |
Roughness |
||
|---|---|---|---|---|
| Reliability ➔ | Intra-listener | Inter-listener | Intra-listener | Inter-listener |
| Synthetic comparisons | .92 | .87 | .96 | .96 |
| UFDVD | .97 | .94 | .95 | .95 |
| KayPENTAX/Sataloff | .96 | .91 | .96 | .98 |
Note. Measured using intraclass correlation coefficients (2, k) where k = 30 repetitions for intra-listener reliability and k = 10 listeners for inter-listener reliability. VQ = voice quality; UFDVD = University of Florida Dysphonic Voice Database.
After the reliability analysis of VQ percepts, the raw data from seven listeners who overlapped across breathiness and roughness judgments of the 13 synthetic comparison sounds for each VQ continuum were fitted with a logistic function from Eddins et al. (2021), as shown in Figure 2. The left panel shows breathiness magnitude as a function of SNR, averaged across seven listeners. The right panel shows roughness magnitude as a function of amplitude modulation depth, averaged across seven listeners. Error bars indicate standard error of the mean. The logistic functions provided an excellent fit (r2 = .99) for both perceived breathiness and roughness data over the range of dysphonic voices. Following Step 3 in the scale development process, the major inflection point of the fitted function (Equation 3) was chosen as the single reference value, shown by star marker in each panel of Figure 2. In Equation 3, x is the value of independent variable (dB SNR or dB modulation depth), a0, a2, and a3 are coefficients of the logistic equation.
| (3) |
Figure 2.
Perceived breathiness magnitude as a function of stimulus signal-to-noise ratio (SNR; top panel) and perceived roughness magnitude as a function of stimulus modulation depth (dB; bottom panel). The solid curve represents the mean across seven listeners, error bar indicates standard error of the mean, and the star marker indicates the inflection point or one standard reference unit.
In the presence of natural dysphonic stimuli, the breathy VQ reference value redefined as one unit corresponded to 15 dB SNR, which is close to but slightly lower than 18 dB SNR established by Eddins et al. (2021) using synthetic stimuli. In a similar manner, the roughness VQ reference value redefined as one unit was equal to −26 dB amplitude modulation depth. Again, this was very similar to the value of −27 dB amplitude modulation depth established by Eddins et al. (2021) using synthetic stimuli. Note that, for roughness, the comparison step size was modified from 3 dB (Eddins et al., 2021) to 2 dB (current study) to allow for the detection of finer variations in perceived roughness and, hence, the reference value can be considered as almost identical.
Experiment 1 Discussion
In this experiment, listeners were presented with a total of 1,710 stimuli (57 stimuli [44 natural and 13 synthetic comparison] × 10 repetitions × 3 blocks), per VQ dimension. Average data from larger numbers of repeated ratings provided a more stable representation of the VQ and was therefore used for the development of the psychophysical scale (Eddins et al., 2021; Shrivastav et al., 2005). Listeners judged VQ percepts of breathiness and roughness on a ratio scale through the ME task. Inexperienced listeners performed the task with high reliability both within and across themselves (as shown in Table 1). In the current study, the breathiness magnitudes plateaued for SNR values of 20 dB and greater, as illustrated by the symbols and the fitted function (see Figure 2, top panel). However, in the previous study by Eddins et al. (2021), the breathiness magnitudes plateaued for SNR values of 27 dB and greater. This difference may be attributed to differences in the stimulus sets used in the two studies. Eddins et al. (2021) only evaluated breathy magnitude for synthetic comparison sounds whereas the current study combined synthetic comparisons along with natural stimuli. It is possible that perceived changes in breathiness across SNR levels were exacerbated when synthetic comparisons were presented along with natural stimuli. This resulted in a minor shift in the standard reference value in the presence of natural dysphonic stimuli and a reduction in the severity range. For roughness (see Figure 2, bottom panel), the standard reference value of −26 dB modulation depth was very similar to the standard reference point of −27 dB established by Eddins et al. (2021), perhaps indicating that the synthetic comparisons integrate more tightly with natural dysphonic voices. This method of VQ scale development used ratio ME as the perceptual task motivated by the sone scale developed for loudness phenomena. It was also motivated by the underlying theory that VQ perception is a psychophysical phenomenon like loudness. However, the use of different ME rating paradigms may lead to different mathematical functions and consequently different “units” (Marks & Florentine, 2010).
Experiment 2: Development and Validation of One-Dimensional Clinical VQ Scales
The second goal of the current study was to validate the ratio-level dimension-specific VQ scales with natural dysphonic stimuli. Experiment 2 addressed this goal through (a) the development of prototype clinical software that incorporated the VQ scales with standard units and (b) validation of the VQ scales in a laboratory environment. Results of Experiment 2 allowed for translation of this tool into a pilot field study.
Experiment 2 Method
Experiment 2: Stimuli
The same 44 breathy and 43 rough natural voice stimuli used in Experiment 1 were also selected for Experiment 2. However, the duration of the stimulus was modified to a stable 1000 ms to better represent and capture real-world clinical assessment practice. Among these 44 breathy stimuli, four were used for training and the remaining 40 were used for testing. For roughness, four were used for training and 39 were used for testing.
Experiment 2: Listeners
Ten new young adult females aged 20–22 years (M ± SD = 20.8 ± 0.9 years) were recruited from the University of South Florida to participate in this experiment. Similar to Experiment 1, all listeners were self-reported native speakers of American English with normal hearing and had no or limited background in communication sciences and disorders and voice evaluation. They were consented in accordance with the procedures approved by the University of South Florida institutional review board (IRB Pro0012381) and compensated for their participation.
Experiment 2: Software Development
A custom-designed MATLAB graphical user interface named “QualEVox” as shown in Figure 3 was developed. The QualEVox software allowed listeners to judge breathiness/roughness of a natural dysphonic stimulus in ratio-level units. The VQ scale was represented as a vertical line ranging from the most breathy at the top to the least breathy at the bottom. The values on the right side of the vertical line depicted breathiness scale units. For example, a severe breathy voice would be evaluated closer to eight units. The buttons on the left of the vertical line were representative of synthetic anchors. The use of synthetic anchor comparisons in this study constrains the percept to a single change per perceptual continuum, which limits variability associated with multiple simultaneous acoustic changes. These anchors were derived from decades of research grounded in psychophysical theories and methods that established the fundamental relationship between physical units and VQ perception, which are unique to each VQ. The synthetic continua captured and exceeded the range of physical units corresponding to dysphonic VQ across a broad spectrum of severity, from mild to severe. The number of anchors used here was established on the basis of the underlying psychophysical function, which captured the range of units and approximated the number of distinct values within that range. Here, one unit of breathiness was an anchor sound with a reference value of 15 dB SNR for breathiness (based on results of Experiment 1) and one unit of roughness was an anchor sound with a reference value of −26 dB amplitude modulation depth for roughness (based on results of Experiment 1). Anchor sounds for other units were based on Step 4 (upper x-axis) of Eddins et al. (2021).
Figure 3.
Sample evaluation of perceived breathiness through a one-dimensional (1D) breathiness scale using the QualEVox software graphical user interface.
Experiment 2: Instrumentation
Stimulus presentation and response collection was controlled using QualEVox software. Listeners were seated in a sound-treated booth, and stimuli were presented using circumaural Sennheiser HD 280 Pro headphones at 75 dB SPL. Headphones were calibrated by entering the dB/Volt specifications given by the headphone manufacturer.
Experiment 2: Procedures
All listeners began their listening session by first familiarizing themselves with VQ percepts of breathiness, roughness, and strain using definitions from Kempster et al. (2009) and descriptions with example sounds provided as a PowerPoint presentation. For breathiness and roughness, a steady 1000-ms vowel /a/ phonation sample varying in three severity levels (mild, moderate, and severe) were used. For strain, mild and severe levels were used. Furthermore, the mild sample varied only in perceived strain and the severe sample had covarying breathiness. They were instructed to focus on individual breathy or rough VQs and ignore other stimulus characteristics including alternative qualities, pitch, or loudness. They then completed a practice trial with four breathy/rough stimuli to get used to the QualEVox scaling task and the software interface. Prior to evaluating individual voice samples, listeners were instructed to click on each of the anchor sounds and listen to them in descending (most to least breathy) and then ascending (least to most breathy) order to familiarize them with the range of physical units. QualEvox evaluation provided a single VQ-specific value for each dysphonic voice sample. Listeners heard the dysphonic stimulus under study with every movement of the vertical slider button. Listeners were asked to move the slider to the position that best represented the breathiness of the dysphonic voice using the synthetic anchors as a reference. Listeners had the ability to listen to any of the anchors and the dysphonic voice as many times as needed before confirming their judgment and moving on to the next voice. Listeners were able to see the number of samples to be rated and monitor their progress at the top (i.e., Sample 1 of 3). For the test, each of the 40 breathy or 39 rough stimuli was presented 10 times to obtain intra-listener reliability. The same set of 10 listeners completed breathy and rough VQ judgments over two different sessions, each lasting a maximum of 2 hr, completed within 1 week. QualEVox software exported the data into an excel sheet for further analyses.
Experiment 2 Results
Similar to Experiment 1, intraclass correlation coefficients [ICC (2, k)] were employed to assess both intra- and inter-listener reliability for the perceptual data corresponding to each VQ. Across all test stimuli, the mean intra-listener reliability, ICC (2, k), k = 10 repetitions, was 0.97, and inter-listener reliability, ICC (2, k), k = 10 listeners, was 0.96 for perceived breathiness. Similarly, the mean intra-listener reliability was 0.97, and inter-listener reliability was 0.98 for perceived roughness. One test stimulus from breathiness (N: 40–1 = 39) was excluded due to incorrect scale units provided by averaged listener data (i.e., for a voice with severe breathiness with ME estimates closer to 1000, the scale unit provided was 1 instead of units closer to 8). A set of simple linear regressions were then used to analyze the relationship between VQ scale units from QualEVox and the perceived magnitude estimates from Experiment 1. For both breathiness and roughness, there was a strong and significant correlation (breathiness: Pearson's r = .93, r2 = .87, p < .001; roughness: Pearson's r = .94, r2 = .88, p < .001) between magnitude estimates obtained from Experiment 1 (y-axes in Figure 4) and VQ scale units obtained from QualEVox software in Experiment 2 (x-axes in Figure 4). This relationship held for the mean of 10 repetitions (typical in labs) as well as the first judgment extracted from the 10 repetitions of the QualEVox scale perceptual data (typical in clinics).
Figure 4.
A scatterplot and linear fit of mean perceived breathiness (top panel) and mean perceived roughness (bottom panel). x-axes denote perceptual data from QualEVox (Experiment 2), and y-axes denote perceptual data from the magnitude estimation (ME) task (Experiment 1).
Experiment 2 Discussion
Results of Experiment 2 demonstrated that inexperienced listeners can perform reliable scaling of breathy and rough VQ using the newly developed VQ scales. High correlations (with same stimuli but different listeners) between magnitude estimates and the VQ scale units supported the validity of VQ scale units as a reliable metric for perceptual VQ measurement. Indeed, even with one repetition of the stimulus, the high correlations between the two perceptual methods remained, supporting the robust nature of the newly developed VQ scaling procedure. This result is in agreement with data from a recent study by Nagle et al. (2024), which surveyed voice-focused speech language pathologists on their use of CAPE-V. Responses from 52 out of the 54 clinicians rated VQ either during or immediately following the evaluation session. However, responses to the use of recordings indicate variability in practice. Although 25% of respondents reported always or usually relying on recordings to inform their ratings, a larger proportion (45%) stated that they rarely or never use recordings. Among those who do use recordings (n = 34), 43% reported listening only once before making a rating decision. The clinical prototype software (QualEVox) developed in this experiment produced reliable and repeatable results across 10 listeners. The current study focused on samples that were predominantly breathy and rough. Given that VQ dimensions often co-occur and co-vary in the majority of dysphonic voices, a logical next step would be to develop three-dimensional matching (3DMA) and magnitude estimation (3DME) tasks to capture this covariance. A new QualEVox3D scale is being developed to allow clinicians to rate the three primary VQ dimensions.
Experiment 3: Evaluation of Psychometric Properties of One-Dimensional Clinical VQ Scales—A Pilot Field Study
The third goal of the current study was to examine psychometric measurement properties of the newly developed breathiness and roughness scales in a small clinical pilot experiment. As in many other fields of evidence-based practice, ideal perceptual measures of VQ need to be reliable, valid, and require the mathematical properties needed to support measurement of magnitude of change or accurately capture responsiveness to change due to disease progression or treatment (e.g., Carding et al., 2009). Experiment 3 was designed to obtain these properties as well as gather clinician feedback about the usability of the QualEVox software.
Experiment 3: Stimuli
First author (S.A.) listened to de-identified samples from the Medical University of South Carolina's (MUSC) clinical database (breathiness: a total of 24 pre- and 16 posttreatment and roughness: a total of 23 pre- and 12 posttreatment) and completed quality assurance (e.g., identification of peak clipping, tremor, presence of other VQs). Samples with peak clipping, presence of tremor or frequency shifts, or with the co-occurrence of strain were not selected. The author rated the severity of the samples as near normal, mild, moderate, or severe to ensure a broad continuum of severities commonly seen in clinical practice as well as to obtain a range of units with the newly developed dimension-specific VQ scales. A total of 19 pre- and 12 posttreatment patients with vocal fold paralysis (with the hallmark breathy vocal quality pretreatment) and 17 pre- and 10 posttreatment patients with vocal fold cysts or polyps (with the hallmark rough vocal quality pretreatment) were finalized through stratified-random sampling. The selected audio files were then cropped to 1000-ms duration for the testing.
Experiment 3: Listeners
Two expert female clinicians (29 and 36 years old) with greater than 5 years of experience in voice evaluation participated in this experiment. Both listeners were native speakers of American English and self-reported normal hearing. This study was approved by the MUSC institutional review board (Pro00089177—for acquisition of samples; Pro00030315—for perceptual experiment).
Experiment 3: Instrumentation.
Stimuli were presented through clinician's computers. However, both clinicians were provided with over-the-ear Vic Firth (Vic Firth) headphones and an iBoundary (MicW Audio) microphone for calibration of the headphones. The QualEVox software was updated from Experiment 2 with two specific changes—(a) the anchor buttons were systematically enabled in descending (e.g., most breathy to least breathy) and ascending order (e.g., least breathy to most breathy), so that clinicians could become familiarized with the range of physical units, and (b) a more accurate method of calibration through a microphone was provided, as described below.
Experiment 3: Procedures
Step 1 of the calibration procedure required clinicians to connect their microphone and headphone to their computers. Step 2 of the calibration procedure was to place a calibrator (Brüel & Kjaer Type 4230) producing 94 dB at 1000 Hz on top of the microphone and recording it using a button on the QualEVox software interface. Step 3 of the calibration procedure was to place each of the channels of the headphones on top of the microphone and record a frequency sweep (0.1–20 kHz) presented through the headphones using a flat-plate coupler. Step 4 of the calibration procedure allowed clinicians to save their calibration data and open/use the same for following sessions.
Similar to Experiment 2, a practice trial with eight stimuli (four primarily breathy; four primarily rough) allowed the clinicians get familiarized with the QualEVox scaling task and the software interface. For the test, each of the pre- and posttreatment samples were presented only once. Breathiness evaluations were completed first followed by roughness on different days. In addition, to measure intrarater reliability, clinicians repeated the entire procedure 2 weeks apart. Each session took a maximum of 45 to 60 min. QualEVox software exported the data into an excel sheet for further analyses. To assess the usability of both the scale and the software (including accuracy and ease of use), clinicians were asked to provide descriptive notes as well as complete a custom usability questionnaire adapted from the 26-item scale of Ben-Zeev et al. (2014) and the well-known System Usability Scale (SUS-A; Brooke, 1995). For the descriptive note, the prompt was “Please use this space for any additional comments you may have.”
The CAPE-V was scored using the conventional protocol of providing a global judgment of each VQ percept of interest after listening to both the vowels and sentences. For the purposes of the current experiment, the same two clinicians provided offline CAPE-V breathy/rough and overall severity judgments based on the 1000-ms /a/ stimuli used in this experiment.
Experiment 3 Results
Experiment 3: Reliability and agreement
Intraclass correlation coefficients [ICC (2, 1)] were employed using two-way mixed-effects model to assess inter-listener and intra-listener reliability for the perceptual data corresponding to each VQ. Across both treatment stages (pre and post) and VQs (N = 19 + 12 + 17 + 10 = 58 stimuli), mean inter-listener reliability was 0.82. Table 2 depicts the mean (averaged across two sessions) inter-listener reliability values for different treatment stages for each of the VQs. For breathiness, the highest (0.82) reliability was observed only when both pre and post data were included in reliability estimation. For roughness, the highest (0.83) reliability and agreement also were observed when both pre- and posttreatment data were included. Table 3 depicts intra-listener reliability values for perceptual data averaged across the treatment stages for each of the clinicians. Both clinicians demonstrated high internal reliability for the VQ percepts.
Table 2.
Mean (averaged across two sessions) inter-listener reliability for pre- and posttreatment for breathy and rough voice quality.
| Task/stimuli |
Breathiness |
Roughness |
|---|---|---|
| Inter-listener | Reliability | Reliability |
| Pretreatment | .80 | .81 |
| Posttreatment | .71 | .64 |
| Both Tx | .82 | .83 |
Note. Tx = treatment.
Table 3.
Intralistener reliability averaged across the treatment (Tx) stages for each voice quality and each of the clinicians.
| Task/stimuli |
Breathiness |
Roughness |
|---|---|---|
| Intra-listener for both Tx | Reliability | Reliability |
| Clinician 1 | .93 | .84 |
| Clinician 2 | .95 | .84 |
Experiment 3: Validity
In the current experiment, concurrent validity was defined as the correspondence between VQ scale units and CAPE-V scores. The rationale for this comparison was that the CAPE-V method has been a well-studied instrument for VQ evaluation and is the current standard of clinical practice. This is similar to how validity of CAPE-V was established through its comparison with GRBAS (Zraick et al., 2011). Table 4 depicts Spearman's rank (rs) correlation coefficient values for CAPE-V scores conducted live as well as offline. These were based on values averaged from the two clinicians. High (r > .9) and significant correlations were observed between the newly developed VQ scale units and offline CAPE-V breathiness judgments. Overall, averaged across treatment stage (pre- and posttreatment) and VQs (breathiness and roughness), the correlation between VQ scale units and offline CAPE-V was the strongest (Spearman's ρ rs = .95, p < .001) followed by VQ scale units and live CAPE-V (Spearman's ρ rs = .48, p < .001). Finally, between the two methods of CAPE-V (Spearman's ρ rs = .47, p < .001).
Table 4.
Spearman's correlation coefficient (rs) analysis across voice quality (VQ) assessment methods.
| VQ scale/CAPE-V |
Offline CAPE-V (VQ) |
Live CAPE-V (VQ) |
||
|---|---|---|---|---|
| Tx stage | Pretreatment | Posttreatment | Pretreatment | Posttreatment |
| Breathy scale units | .97** | .92** | .62* | .30 |
| Rough scale units | .89** | .88** | .32 | .41 |
Note. CAPE-V = Consensus Auditory-Perceptual Evaluation of Voice. Tx = treatment.
p < .05.
p < .001 level.
Experiment 3: Responsiveness to change
For breathiness, there were 12 samples with both pre- and posttreatment ratings and, for roughness, there were 10 samples with both pre- and posttreatment ratings. The ratings of these matched samples were entered into a paired-samples t test on SPSS v29 software. In SPSS, when interpreting Cohen's d effect size from a paired t test, a value of 0.2 indicates a small effect, 0.5 represents a medium effect, and 0.8 denotes a large effect. There was a significant difference, t(11) = 4.952, p < .001, in breathiness scale units for pretreatment (M = 4.79, SD = 2.23) and posttreatment (M = 1.86, SD = 0.95). Effect size (Cohen's d = 2.05) indicated a large effect. Similarly, there was also a significant difference, t(9) = 3.734, p = .005, in roughness scale units for pretreatment (M = 2.56, SD = 1.46) and posttreatment (M = 0.70, SD = 0.61). A large effect size was also observed for roughness (Cohen's d = 1.58). Figure 5 depicts the mean ratings (averaged across the two clinicians) of each stimulus as well as averaged across the stimuli for breathiness (top panel) and roughness (bottom panel). For the offline CAPE-V task, the mean breathiness score changed from 65 (pretreatment; SD = 22.8) to 29 (posttreatment; SD = 16.5). Similarly, the mean roughness score changed from 33 (pretreatment; SD = 17.9) to 13 (posttreatment; SD = 14.4). Although the reduction in VQ scores indicates the direction of change and improvement following treatment, inferring the precise magnitude of this positive change is questionable and, at best, a general approximation since the CAPE-V scale is ordinal-like and lacks the mathematical properties that preclude the accurate measurement of responsiveness to change.
Figure 5.

Line graph representing breathiness scale units (top panel) and roughness scale units (bottom panel) as a function of pre- and posttreatment. Individual data points (averaged across the two clinicians) for each of the 12 stimuli are shown along with mean across all the stimuli.
Experiment 3: Usability of QualEVox
QualEVox usability assessment included a survey that evaluated whether the clinical tool was user-friendly, using 5-point Likert scales. Results from the usability questionnaire indicated that the scales were highly efficient and easy to learn and use, though with some needed improvement of perceptual anchors and software interface (see the Appendix).
Experiment 3 Discussion
Standard VQ scales serve as critical benchmarks for tracking disease progression and assessing treatment outcomes. Experiment 3 explored three key measurement properties essential for any outcome tool: reliability, validity, and responsiveness to change. Additionally, to be practical, such tools must be quick and easy to administer. The feasibility, acceptability, and effectiveness of QualEVox for assessing breathy and rough VQ were evaluated by two clinicians prior to a larger clinical trial. The primary objective was to assess potential benefits and identify challenges with the proposed VQ scaling method in a real-world setting, allowing for refinement of the tool. (e.g., Batty et al., 2002).
Perceptual judgments of breathiness and roughness using QualEVox were highly reliable within (r ≥ .8) and between (r ≥ .8) the two clinicians. Past research has reported that experienced listeners may introduce greater variability in VQ judgments due to their flexible strategies in VQ perception (Kreiman & Gerratt, 2000). However, our experienced clinicians demonstrated high reliability. This could be because of clinicians working in the same clinical setting, their cohesive training and background, as well as the psychophysical scaling method. The robustness of the scaling procedure using QualEVox with different clinicians in a larger study is a future direction. High correlations observed between VQ scale units and offline (with only /a/ used for QualEVox and recorded) and CAPE-V judgments provide validation for the newly developed VQ scales. Indeed, the low correlations on live CAPE-V with scale units is not surprising, because for all those cells in Table 4, there was only a moderate correlation between two methods of CAPE-V. Such a difference between live and offline CAPE-V has been reported in previous studies (Solomon et al., 2011). It is likely that several factors may have contributed to this discrepancy. In the offline CAPE-V, recorded stimuli were presented over headphones, ratings were performed only on 1000-ms vowel samples, and listeners had limited knowledge about talkers. On the other hand, live CAPE-V may involve clinicians listening to a live voice from a patient or listening to recordings over headphones or speakers. Ratings are typically “global” judgments across different stimulus types (i.e., vowels /a/ and /i/, six sentences, and connected speech). Furthermore, CAPE-V ratings are often completed by the same clinician who has a priori knowledge about the patient and their voice disorder diagnosis and has information about age, biological sex, and gender. Measurement of VQ in standard units was able to accurately capture direction and magnitude of change. There were statistically significant improvements in breathiness and roughness of samples posttreatment, as evidenced by a reduction in scale units and seen in Figure 5. The clinical significance of this treatment outcome measured using Cohen's effect size was large. Essentially, a higher Cohen's d value reflects greater practical significance of the observed difference between the paired groups. This further supports the clinical translation and implementation of VQ scaling procedure to measure responsiveness to change for dysphonic voices.
Overall, clinicians noted consistency of functions within the QualEVox software and that it was relatively easy to learn and not cumbersome in terms of time required to do VQ evaluations. They were satisfied with the software and would use it frequently in their future evaluations. However, the instructions for installation and setup were found to be difficult. Suggestions noted inclusion of a step-by-step tutorial video. Some design aspects such as addition of a “play” and “previous” button and increasing the duration of the sample (longer vowels and sentences) to be rated. Specific to the synthetic anchors, the difference between the lower units (0.125 and 0.25) was perceptually indiscernible. A revision of the program considering all the feedback and a larger clinical trial with large set of stimuli, clinicians, across different centers, and addition of a “training” module would enhance the VQ measurement.
General Discussion
The use of auditory-perceptual tools like the GRBAS (Hirano, 1981) and CAPE-V (Kempster et al., 2009) scales have been pivotal in the measurement of VQ attributes in adults and children with a multitude of voice disorders and in establishing consistency across clinical and research contexts. These tools facilitate comparison of treatment outcomes and ensure reproducibility in research findings. However, the inherent subjectivity of perceptual evaluations remains a challenge. One area of research has focused on building consensus through rater training, training protocols, anchors, and calibration to minimize variability and improve reliability (Barsties & De Bodt, 2015; Chan & Yiu, 2002, 2006; Eadie & Baylor, 2006). Ongoing efforts to improve such scales also note the need for refinement to improve responsiveness to subtle variations and ecological validity (Nagle, 2024, 2025; Solomon et al., 2011). There is some level of debate on the use of natural versus synthesized anchors. Although supporters of natural samples with varying levels of VQ severity highlight the strong resemblance between these anchors and the target dysphonic stimulus to be rated, the parameters of the synthesized stimuli can be systematically and individually adjusted. Furthermore, data driven guidance on the optimal number of anchors to use is not available.
There are several limitations of this study that are worthy of consideration. First, Experiment 1 reestablished the standard reference units using ME tasks for breathiness and roughness. While the different databases used for breathy and rough tasks in this study varied in population sampling, elicitation, and recording methods, advantages include obtaining a diverse range of voices and VQ severity. A detailed quality assurance protocol that eliminates features such as peak clipping, background noise, presence of tremor or other frequency/amplitude modulations, and co-occurrence of other qualities such as strain or nasality helped to mitigate those limitations while leveraging the advantages. It is possible that the use of different stimulus sets across studies (not different experiments here) may result in a slight shift in the fitted logistic function and, thus, the standard reference units. This is a challenge that can be solved through comparison of perceived VQ units with objective data from computational modeling. Computational modeling can handle large data sets in a time-efficient manner and can also be computed on connected speech. Second, in this study, we used the same set of listeners who completed both breathiness and roughness ME tasks for the fitting of logistic function to ensure that familiarity and practice effects across listeners were consistent. Third, the current study developed scales for voices varying in only one VQ dimension; however, many dysphonic voices are multidimensional, and future research needs to refine scaling software and procedures to include accurate measurement of co-occurring and covarying dimensions. Finally, this study focused only on vowels and a logical extension with connected speech stimuli is essential for improving ecological validity and official clinical use.
Development of standardized scales with desired ratio-level measurement properties for VQ evaluation is critical to enhance the perceptual rating process. The conversion of physical scales of VQ to psychophysical scales is a transformative step in making the proposed measurements operable in research, clinical, and other applications. For example, adoption of standard scales will make it possible to index the precise magnitude of VQ change (such as before or after therapy) and will allow one to validly assess the effect size of treatment. Thus, it would be possible to accurately state that “Treatment A improved VQ by 30%, whereas treatment B resulted in an 18% improvement.” This is currently not possible through conventional approaches because these do not adequately quantify the relationship between underlying physical variables and VQ perception. Data obtained through experiments with standard scales will also allow development of computational models for VQ perception that accurately reflect listener judgment. Computational VQ models that target stable population averages and eliminate risk of poor inter- or intrarater reliability and agreement make VQ measurement more standardized and effective. This will facilitate interpretation and communication among clinicians, scientists, and patients.
Conclusions
The present study established standard reference points for VQ, tested the external validity of VQ dimension ratings specific to breathiness and roughness, and evaluated the psychometric properties of these newly developed scales in a clinical context. In achieving these goals, the present study proposed a rigorous method for improving the perceptual evaluation of voice, increasing accuracy, precision, and application.
Data Availability Statement
The published data are available from the corresponding author upon reasonable request.
Acknowledgments
This work was supported by the National Institute on Deafness and Other Communication Disorders R01DC009029 (awarded to David A. Eddins and Rahul Shrivastav). The authors would like to thank Madison Dyjak for Experiment 1 data collection, Erol Ozmeral for Experiment 1 data analyses and creating QualEVox software for Experiments 2 and 3, and two expert voice clinicians at Medical University of South Carolina who completed perceptual judgments and provided usability data for Experiment 3.
Appendix
Usability Questionnaire Results
| Questions | Clinician 1 ratings | Clinician 2 ratings |
|---|---|---|
| 1. I think that I would like to use the QualEVox program frequently. | 5 | 5 |
| 2. I found the QualEVox program unnecessarily complex. | 2 | 4 |
| 3. I thought the QualEVox program was easy to use. | 4 | 3 |
| 4. I think that I would need the support of a technical person to be able to use the QualEVox program. | 5 | 4 (*initially) |
| 5. I found the various functions of the QualEVox program were well integrated. | 4 | 4 |
| 6. I thought there was too much inconsistency in the QualEVox program. | 1 | 1 |
| 7. I would imagine that most people would learn to use the QualEVox program very quickly. | 4 | 5 |
| 8. I found the QualEVox program very cumbersome to use. | 2 | 4 |
| 9. I felt very confident using the QualEVox program. | 4 | 4 |
| 10. I needed to learn a lot of things before I could get going with the QualEVox program. | 5 | 3 |
| 11. I was able to complete the voice quality evaluations quickly using QualEVox program. | 5 | 5 |
| 12. It took a long time to complete voice quality evaluations through QualEVox program. | 1 | 1 |
| 13. The information provided for QualEVox was easy to understand. | 4 | 3 |
| 14. QualEVox instructions were hard to follow. | 1 | 3 |
| 15. It was easy to install and setup QualEVox. | 1 | 2 |
| 16. QualEVox was hard to install and setup. | 5 | 4 |
| 17. QualEVox screens were easy to understand. | 4 | 4 |
| 18. It was hard to understand the information presented on QualEVox screens. | 1 | 2 |
| 19. Overall, I am satisfied with QualEVox. | 4 | 4 |
| 20. I am not satisfied with QualEVox program. | 1 | 2 |
Note. Ratings are on a scale from 1 (strongly disagree) to 5 (strongly agree).
Responses to the prompt “Please use this space for any additional comments you may have.”
Setting up and opening the QualEVox program was the most difficult part.
I was unable to choose my age/gender in the setup portion.
Usability (once setup and open) was good - > pretty easy to learn.
I can imagine that anyone with some basic comfort with computers could readily learn to use QualEVox.
One thing I wish was different was the spacing of the rating numbers.
The jump from 2–4–8 was equal on the scale however datapoints between the numbers increased. I can appreciate this causing confusion for some users.
It was difficult to discern the differences between anchors 0.125 and 0.25.
Overall, I felt this was a quick and fairly easy way to judge recordings.
Funding Statement
This work was supported by the National Institute on Deafness and Other Communication Disorders R01DC009029 (awarded to David A. Eddins and Rahul Shrivastav).
Footnote
The age and sex of talkers with dysphonia used in Experiment 1 for breathiness and roughness (44 talkers each) were not completely available in the databases. For breathiness, the mean age across 31 of the 44 talkers was 57 years (range: 26–80 years; 13 males and 18 females). For roughness, the mean age across 10 of the 44 talkers was 62 years (range: 47–76 years; five males and five females).
References
- American National Standards Institute. (2004). Methods for manual pure-tone threshold audiometry (ANSI S3.21–2004). [Google Scholar]
- Anand, S. (2023). Perceptual and computational estimates of vocal breathiness and roughness in sustained phonation and connected speech. Journal of Voice, 39(4), 1131.e31–1131.e43. 10.1016/j.jvoice.2023.02.014 [DOI] [PubMed] [Google Scholar]
- Anand, S., Skowronski, M. D., Shrivastav, R., & Eddins, D. A. (2019). Perceptual and quantitative assessment of dysphonia across vowel categories. Journal of Voice, 33(4), 473–481. 10.1016/j.jvoice.2017.12.018 [DOI] [PubMed] [Google Scholar]
- Barsties, B., & De Bodt, M. (2015). Assessment of voice quality: Current state-of-the-art. Auris Nasus Larynx, 42(3), 183–188. 10.1016/j.anl.2014.11.001 [DOI] [PubMed] [Google Scholar]
- Batty, S. V., Howard, D. M., Garner, P. E., Turner, P., & White, A. D. (2002). Clinical pilot study assessment of a portable real-time voice analyser. Logopedics Phoniatrics Vocology, 27(2), 59–62. 10.1080/140154302760409266 [DOI] [PubMed] [Google Scholar]
- Ben-Zeev, D., Brenner, C. J., Begale, M., Duffecy, J., Mohr, D. C., & Mueser, K. T. (2014). Feasibility, acceptability, and preliminary efficacy of a smartphone intervention for schizophrenia. Schizophrenia Bulletin, 40(6), 1244–1253. 10.1093/schbul/sbu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brooke, J. (1995). SUS: A quick and dirty usability scale. In Jordan P. W., Thomas B., McClelland I. L., & Weerdmeester B. (Eds.), Usability evaluation in industry (pp. 189–194). CRC Press. [Google Scholar]
- Carding, P. N., Wilson, J. A., MacKenzie, K., & Deary, I. J. (2009). Measuring voice outcomes: State of the science review. The Journal of Laryngology & Otology, 123(8), 823–829. 10.1017/S0022215109005398 [DOI] [PubMed] [Google Scholar]
- Chan, K. M. K., & Yiu, E. M.-L. (2002). The effect of anchors and training on the reliability of perceptual voice evaluation. Journal of Speech, Language, and Hearing Research, 45(1), 111–126. 10.1044/1092-4388(2002/009) [DOI] [PubMed] [Google Scholar]
- Chan, K. M. K., & Yiu, E. M.-L. (2006). A comparison of two perceptual voice evaluation training programs for naive listeners. Journal of Voice, 20(2), 229–241. 10.1016/j.jvoice.2005.03.007 [DOI] [PubMed] [Google Scholar]
- Desjardins, M., Halstead, L., Cooke, M., & Bonilha, H. S. (2017). A systematic review of voice therapy: What “effectiveness” really implies. Journal of Voice, 31(3), 392.e13–392.e32. 10.1016/j.jvoice.2016.10.002 [DOI] [PubMed] [Google Scholar]
- Eadie, T. L., & Baylor, C. R. (2006). The effect of perceptual training on inexperienced listeners' judgments of dysphonic voice. Journal of Voice, 20(4), 527–544. 10.1016/j.jvoice.2005.08.007 [DOI] [PubMed] [Google Scholar]
- Eddins, D. A., Anand, S., Camacho, A., & Shrivastav, R. (2016). Modeling of breathy voice quality using pitch-strength estimates. Journal of Voice, 30(6), 774.e1–774.e7. 10.1016/j.jvoice.2015.11.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddins, D. A., Anand, S., Lang, A., & Shrivastav, R. (2021). Developing clinically relevant scales of breathy and rough voice quality. Journal of Voice, 35(4), 663.e9–663.e16. 10.1016/j.jvoice.2019.12.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heman-Ackah, Y. D., Michael, D. D., & Goding, G. S., Jr. (2002). The relationship between cepstral peak prominence and selected parameters of dysphonia. Journal of Voice, 16(1), 20–27. 10.1016/S0892-1997(02)00067-X [DOI] [PubMed] [Google Scholar]
- Hirano, M. (1981). Clinical examination of voice. Springer. [Google Scholar]
- Karnell, M. P., Melton, S. D., Childes, J. M., Coleman, T. C., Dailey, S. A., & Hoffman, H. T. (2007). Reliability of clinician-based (GRBAS and CAPE-V) and patient-based (V-RQOL and IPVI) documentation of voice disorders. Journal of Voice, 21(5), 576–590. 10.1016/j.jvoice.2006.05.001 [DOI] [PubMed] [Google Scholar]
- Kay Elemetrics Corporation. (1994). Disordered voice database, model 4337. Kay Elemetrics. [Google Scholar]
- Kempster, G. B., Gerratt, B. R., Verdolini Abbott, K., Barkmeier-Kraemer, J., & Hillman, R. E. (2009). Consensus Auditory-Perceptual Evaluation of Voice: Development of a standardized clinical protocol. American Journal of Speech-Language Pathology, 18(2), 124–132. 10.1044/1058-0360(2008/08-0017) [DOI] [PubMed] [Google Scholar]
- Kempster, G. B., Nagle, K. F., & Solomon, N. P. (2025). Development and rationale for the Consensus Auditory-Perceptual Evaluation of Voice—Revised (CAPE-Vr). Journal of Voice. Advance online publication. 10.1016/j.jvoice.2025.01.022 [DOI] [PubMed] [Google Scholar]
- Kreiman, J., & Gerratt, B. R. (2000). Sources of listener disagreement in voice quality assessment. The Journal of the Acoustical Society of America, 108(4), 1867–1876. 10.1121/1.1289362 [DOI] [PubMed] [Google Scholar]
- Marks, L. E., & Florentine, M. (2010). Measurement of loudness, part I: Methods, problems, and pitfalls. In Florentine M., Popper A., & Fay R. (Eds.), Loudness (pp. 17–56). Springer. 10.1007/978-1-4419-6712-1_2 [DOI] [Google Scholar]
- Nagle, K. F. (2016). Emerging scientist: Challenges to CAPE-V as a standard. Perspectives of the ASHA Special Interest Groups, 1(3), 47–53. 10.1044/persp1.SIG3.47 [DOI] [Google Scholar]
- Nagle, K. F. (2025). Clinical use of the CAPE-V scales: Agreement, reliability and notes on voice quality. Journal of Voice, 39, 685–698. 10.1016/j.jvoice.2022.11.014 [DOI] [PubMed] [Google Scholar]
- Nagle, K. F., Kempster, G. B., & Solomon, N. P. (2024). Survey of voice-focused speech-language pathologists' usage of the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V). Journal of Voice. Advance online publication. 10.1016/j.jvoice.2024.08.032 [DOI] [PubMed] [Google Scholar]
- Nemr, K., Simoes-Zenari, M., Cordeiro, G. F., Tsuji, D., Ogawa, A. I., Ubrig, M. T., & Menezes, M. H. M. (2012). GRBAS and CAPE-V scales: High reliability and consensus when applied at different times. Journal of Voice, 26(6), 812.e17–812.e22. 10.1016/j.jvoice.2012.03.005 [DOI] [PubMed] [Google Scholar]
- Park, Y., Anand, S., Gifford, S. M., Shrivastav, R., & Eddins, D. A. (2023). Development and validation of a single-variable comparison stimulus for matching strained voice quality using a psychoacoustic framework. Journal of Speech, Language, and Hearing Research, 66(1), 16–29. 10.1044/2022_JSLHR-22-00280 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park, Y., Anand, S., Ozmeral, E. J., Shrivastav, R., & Eddins, D. A. (2022). Predicting perceived vocal roughness using a bio-inspired computational model of auditory temporal envelope processing. Journal of Speech, Language, and Hearing Research, 65(8), 2748–2758. 10.1044/2022_JSLHR-22-00101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patel, S., Shrivastav, R., & Eddins, D. A. (2010). Perceptual distances of breathy voice quality: A comparison of psychophysical methods. Journal of Voice, 24(2), 168–177. 10.1016/j.jvoice.2008.08.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patel, S., Shrivastav, R., & Eddins, D. A. (2012a). Developing a single comparison stimulus for matching breathy voice quality. Journal of Speech, Language, and Hearing Research, 55(2), 639–647. 10.1044/1092-4388(2011/10-0337) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patel, S., Shrivastav, R., & Eddins, D. A. (2012b). Identifying a comparison for matching rough voice quality. Journal of Speech, Language, and Hearing Research, 55(5), 1407–1422. 10.1044/1092-4388(2012/11-0160) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shrivastav, R., Sapienza, C. M., & Nandur, V. (2005). Application of psychometric theory to the measurement of voice quality using rating scales. Journal of Speech, Language, and Hearing Research, 48(2), 323–335. 10.1044/1092-4388(2005/022) [DOI] [PubMed] [Google Scholar]
- Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 10.1037/0033-2909.86.2.420 [DOI] [PubMed] [Google Scholar]
- Solomon, N. P., Helou, L. B., & Stojadinovic, A. (2011). Clinical versus laboratory ratings of voice using the CAPE-V. Journal of Voice, 25(1), e7–e14. 10.1016/j.jvoice.2009.10.007 [DOI] [PubMed] [Google Scholar]
- Stevens, S. S. (1936). A scale for the measurement of a psychological magnitude: Loudness. Psychological Review, 43(5), 405–416. 10.1037/h0058773 [DOI] [Google Scholar]
- Zraick, R. I., Kempster, G. B., Connor, N. P., Thibeault, S., Klaben, B. K., Bursac, Z., & Glaze, L. E. (2011). Establishing validity of the Consensus Auditory–Perceptual Evaluation Of Voice (CAPE-V). American Journal of Speech-Language Pathology, 20(1), 14–22. 10.1044/1058-0360(2010/09-0105) [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The published data are available from the corresponding author upon reasonable request.




