Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 May 9.
Published before final editing as: J Voice. 2025 May 9:S0892-1997(25)00155-9. doi: 10.1016/j.jvoice.2025.04.006

Automated analysis of relative fundamental frequency in continuous speech: Development and comparison of three processing pipelines

Mark Berardi 1,2,*, Erin Tippit 3, Yixiang Gao 4, Guilherme N DeSouza 4, Maria Dietrich 2,3
PMCID: PMC12947873  NIHMSID: NIHMS2081360  PMID: 40348688

Abstract

Objectives:

Relative fundamental frequency (RFF) estimates laryngeal tension during speech, providing insights into vocal effort. Current methods to derive RFF from continuous speech require manual processing, hindering large-scale studies with ecologically valid speech productions. This research aimed to develop and evaluate three fully automated pipelines for RFF analysis from continuous speech, addressing this limitation.

Methods:

Three pipelines were compared: two modifications of an existing semi-automated approach (aRFF-AP) and one novel pipeline replicating manual analysis. The pipelines were tested on speech samples containing vowel-consonant-vowel (VCV) utterances from 82 female participants with and without vocal fatigue complaints in the absence of phonotraumatic vocal fold changes. The pipelines automatically segmented VCVs and measured RFF. Manual measurements of a subset provided reliability and validity benchmarks.

Results:

All pipelines demonstrated good reliability (r ≥ 0.84) and validity when compared to manual analysis. They required minimal manual correction (< 4%) for fricative identification. Notably, the novel aRFF-B pipeline rejected the fewest samples (10–25%) while maintaining reliability and was able to leverage parallel computing.

Conclusions:

Three automated pipelines, especially aRFF-B, enabled time-efficient RFF analysis of large continuous speech data sets without manual intervention. This advancement can facilitate large-scale studies using RFF applied to continuous speech, potentially expanding its application in voice research and clinical practice.

Keywords: vocal effort, speech processing, voice assessment

Introduction

Relative fundamental frequency (RFF) is a temporal-based acoustic measure that estimates laryngeal tension.1 This measure has been used to show changes in vocal effort in both vocally healthy populations2,3 and populations with voice disorders such as Parkinson’s Disease,4 laryngeal dystonia,5 and vocal hyperfunction in individuals with and without phonotrauma.68 RFF measures the semitone change of glottal pulses during the transitions from a vowel to a voiceless consonant and then from the consonant to another vowel (e.g., /afa/). While RFF is traditionally measured manually by a trained technician using the software Praat to identify the glottal periods of the speech signal,9 this approach becomes impractical for large-scale analyses.

Research on vocal effort has historically been limited by the lack of reliable, objective measures. There is currently a need of objective measures for vocal effort that can be efficiently applied to large data sets. While various acoustic and aerodynamic parameters have been proposed to quantify vocal effort,3,10 many require specialized equipment or labor-intensive manual analysis that makes large-scale studies impractical. This has hindered understanding of how vocal effort manifests across different speaking conditions and populations. Furthermore, recent application in ambulatory voice monitoring using neck-surface accelerometers11 has raised the possibility that RFF could serve as an objective biomarker of ecological vocal effort in daily life. Thus, the development of automated measurement techniques for parameters like RFF represents a critical step toward more comprehensive investigations of vocal effort mechanisms and their clinical applications within speech-language pathology and related disciplines.

Previous work demonstrates that RFF calculations can be automated.12,13 This algorithm (referred to as aRFF-AP) is a MATLAB (MathWorks, Natick, MA) script for semi-automated RFF analysis. However, this algorithm requires manual supervision to confirm the accurate identification of the voiceless consonant and is only validated for uniform vowel-consonant-vowel (VCV) utterances (e.g. /afa/; algorithm details can be found in Vojtech et al.12). While VCV utterances offer advantages of lower intra-subject variability and simplified computation9 a review of 37 articles [published through September 2021] reported that 43% of RFF studies used continuous speech stimuli.14 These studies utilized various continuous speech stimuli, including the Rainbow Passage,15 CAPE-V sentences,16 and task-specific stimuli like a Stroop task.2,17 However, all relied manual analysis due to the lack of automated algorithms for continuous speech.

The limitations of the current semi-automated approach become particularly apparent in large-scale analyses. Omotara et al.18 found that 43% of the audio files from 92 female participants required manual intervention for fricative identification in uniform VCV utterances. This high rate of required human intervention makes the existing approach unsuitable for large-scale studies. Modern voice analysis increasingly relies on machine learning approaches,19 which require large datasets to develop robust, generalizable models. The current need for extensive manual intervention precents the creation of such data. Furthermore, continuous speech provides greater ecological validity than VCV combinations, highlighting the need for automated RFF analysis capabilities in natural speech contexts.

The purpose of this study was to develop and evaluate automated computation pipelines for RFF analysis in continuous speech that could be used with large datasets. Here three new pipelines were developed: two modifications of the existing semi-automated approach (aRFF-AP12) and one novel procedure designed to replicate manual analysis. These pipelines were evaluated against manual analysis of a subset of samples through measures of reliability and validity.

Methods

Analysis Pipeline Overview

The RFF analysis pipelines consisted of five main stages: (1) acquisition of acoustic data, (2) VCV segmentation, (3) fricative identification, (4) glottal pulse measurement, and (5) RFF calculation (see Figure 1).

Figure 1.

Figure 1.

Visual representation of the analysis procedures for the three pipelines for automated relative fundamental frequency (aRFF) and one pipeline for manual analysis: aRFF-A1, modified pipeline of existing aRFF-AP12 with no change to fricative identification; aRFF-A2, modified from aRFF-AP with PFA/HTK fricative identification; aRFF-B, novel pipeline; mRFF, manual analysis. These pipelines were implemented using the following algorithms: Penn Forced Aligner (PFA) with the Hidden Markov Speech Recognition Toolkit (HTK); Praat: Extract pulses, modified semi-automated script (aRFF-A0 [modified from aRFF-AP12]), and the novel automated script (aRFF-B). The implementation also utilized various software including: aAudacity (v. 2.4.2), bPython (v. 3.10.0), cMATLAB (v. 2021a), dPraat (v. 6.1.12), and eMicrosoft Excel (Microsoft 365).

Acquisition of acoustic data

A variety of speech samples can be used for RFF measurement. Here bespoke sentences were read by the participants and recorded with a microphone following standards for acoustic recordings.

Participants

Following informed consent (University of Missouri IRB), 82 female participants with and without vocal fatigue based on the Vocal Fatigue Index20 and without phonotraumatic changes of their vocal folds per laryngeal videostroboscopy were included for this study. Nine of these participants (student teachers or early career teachers) had repeat sessions, resulting in 91 total data sets. Participants’ ages ranged from 21–39 years (M = 24.3, SD = 3.3). Mean VFI scores were 7.6 (SD = 5.9, range 0–28), 2.4 (SD = 2.5, range 0–12), and 5.6 (SD = 3.8, range 0–12) for parts 1, 2, and 3 of the VFI, respectively. For more information about inclusion and exclusion criteria and procedures see Gao et al.21

Speech samples

The participants were recorded with a head-set microphone (AKG C520, Vienna, Austria) placed at 4 centimeters at a 45-degree angle from the corner of the mouth. They were recorded at 44.1 kHz and with 16-bit resolution in a sound isolation booth.

Each participant repeatedly read two sentences, which were previously used in RFF measurements from continuous speech.9 They read each sentence 55 times as part of a larger study. The sentence productions were preceded by repetitions of the vowels /a/, /i/, and /u/ 55 times each. Each sentence contained three target VCV utterances for RFF measurement. The first sentence was “The dew shimmered over my shiny blue shell again.” with the target fricative of /ʃ/ and the second sentence was “Only we feel you do fail in new fallen dew.” with the target fricative of /f/ (the target VCV utterances bolded and underlined in each sentence). From these sentences, there was a potential of 55 (trials per session) × 3 (instances per sentence) × 2 (sentences) = 330 RFF instances per participant.

VCV segmentation

This stage isolated and segmented the specific VCV utterances within the speech samples that were needed for RFF analysis. For the automated pipelines, the Penn Forced Aligner with Hidden Markov Model Toolkit (PFA/HTK) module was used (see below for details). For manual analysis, this was done using an open-source audio editor Audacity (v. 2.4.2 Audacity Team).

Fricative identification

In this stage, the algorithms locate and mark the boundary of the voiceless fricative (/f/ or /ʃ/) within the VCV sequence. This is used to separate the offset and onset portions of the utterance.

Glottal pulse measurement

The glottal pulse measurement identifies the individual cycles of vocal fold vibration before and after the fricative. The pulses preceding the fricative are offset pulses as they represent the termination of voicing because of vocal fold abduction for the voiceless consonant. The pulses following the fricative are onset pulses as they represent the beginning of phonation from vocal fold adduction (see Figure 2).

Figure 2.

Figure 2.

Number of offset and onset cycles used in relative fundamental frequency (RFF) measurement from a vowel-consonant-vowel utterance from continuous speech. The offset glottal pulses precede the consonant, and the onset glottal pulses follow the consonant.

RFF calculation

RFF is calculated as the semitone difference between each of the instantaneous frequencies of the 10 glottal cycles surrounding a voiceless fricative and a steady-state frequency, which is the glottal cycle ten cycles away from the fricative and is part of the 10 cycles. This calculation can be expressed as:

RFFn=12*log2fn/fss (1)

where n is the glottal cycle number from 1 to 10 with fn as the instantaneous frequency of the nth cycle and fss is the instantaneous frequency of the steady-state cycle. For offset RFF, the steady-state frequency is from cycle 1, while for onset RFF the steady-state frequency is from cycle 10 (see Figure 2). This calculation was computed with MATLAB (v. 2021a) or Excel (Office 365, Microsoft Corporation, Redmond, WA) depending on the pipeline.

Analysis Modules

Multiple analysis modules were used across the pipelines. Here each module is described in terms of its function and role in the analysis stages. Not all modules were used in all pipelines.

aRFF-AP

This routine is an existing semi-automated RFF-analysis script and not directly used in this study due to its need for manual intervention. Additionally, the default settings of this script require specific speech samples with three uniform utterances (e.g. /afa afa afa/) not embedded in sentences. This routine can process two analysis stages: (3) fricative identification and (4) glottal pulse measurement. The script is implemented within MATLAB and is available online (for more details see Vojtech et al.12).

aRFF-A0

This routine is a modification of aRFF-AP that removes the manual intervention step and adjusts the input parameter of number of processed VCV utterances from three to one (e.g. /afa/). This routine was implemented in MATLAB and processed two analysis stages: (3) fricative identification and (4) glottal pulse measurement.

PFA/HTK

This is an analysis module used to align speech to text for automated VCV segmentation. To do this, the text of the sentence and the audio sample are provided to the Penn Forced Aligner (PFA22), which uses the Hidden Markov Speech Recognition Toolkit (HTK23) and the script produces a TextGrid with time points for each word and each speech sound within the sentence. This process is implemented in a UNIX shell environment. In the present study, the TextGrid from the PFA script was processed with the audio file to segment each of the target word pairs with a VCV utterance for RFF analysis. This processing was implemented in MATLAB and used in the analysis stages of (2) VCV segmentation and (3) fricative identification.

Since this module has not specifically been previously used for (3) fricative identification, a manual validation was conducted. A subset of 10% of the data (3,003 samples) was manually checked (E.T.) for errors. This process included visual inspection of the waveform with the machine-processed fricative location marked on the waveform and was accompanied by the audio, which was playable for the rater. The rater judged whether the fricative identification was correct or not. This process was implemented in MATLAB. The samples were checked twice, once with the fricative identification from the preexisting algorithm aRFF-A0 and again with the PFA/HTK module approach. To be included in the pipeline for (3) fricative identification, the PFA/HTK module needed to contain equivalent or fewer errors as the aRFF-A0 algorithm.

Praat

Praat is a software platform for phonetic analysis.24 This software has been previously used for the manual analysis of RFF.9 Praat was used in two analysis stages: (3) fricative identification and (4) glottal pulse measurement.

Analysis Pipelines

Three automated analysis pipelines and one manual pipeline were used. All pipelines used the same process for (1) acquisition of acoustic data as is described in the Speech Samples section. See Figure 1 for a summary of the analysis pipelines.

aRFF-A1

This pipeline used the PFA/HTK module for (2) VCV segmentation, aRFF-A0 for (3) fricative identification and (4) glottal pulse measurement, and MATLAB for (5) RFF calculation.

aRFF-A2

This pipeline used the PFA/HTK module for (2) VCV segmentation and (3) fricative identification, aRFF-A0 for (4) glottal pulse measurement, and MATLAB for (5) RFF calculation.

aRFF-B

This pipeline used the PFA/HTK module for (2) VCV segmentation and (3) fricative identification. For this pipeline, the process (4) glottal pulse measurement was newly developed with the intent to replicate and automate the tools and processes technicians use to manually measure RFF. The glottal pulse measurement was completed using a command-line version of Praat through system commands operated by MATLAB, which allows for batch automation and a singular analysis platform. Praat was configured to use its cross-correlation algorithm with a fundamental frequency range of 75 Hz to 500 Hz and used the “To PointProcess (periodic, cc)” function.

A multiple windowing approach was implemented to account for discrepancies in measured pulses between analysis windows. The pulse measurement utilized repeated sampling with a sliding rectangular window, which was initially sized to contain the entire fricative and 20 pulses of the vowel. For the onset vowel, the window moved from the start of the fricative to 20 milliseconds before the end of the fricative at a rate of 100 Hz. Window size and thresholds for the multiple window pulse identification process were calibrated using comparison to an existing RFF training set25 until the algorithm achieved a minimum reliability of 0.93 in agreement with the benchmark of the training set. Based on this validation, pulses that were contained in at least 30% of the windows were designated as true pulses.

With the measured glottal pulses, aRFF-B used MATLAB for (5) RFF calculation.

mRFF

This pipeline used manual analysis for (3) fricative identification, (4) glottal pulse measurement, and (5) RFF calculation. The analysis procedure followed instructions from Vojtech and Heller Murray.25 Three trained RFF technicians (E.T., A.H., M.P.) measured RFF from a subsample of 15% of the total samples to measure the reliability between the automated approaches and manual analysis. The raters all completed the tutorial for manual RFF estimation using Praat and passed the training set.25

Rejection Criteria

Each pipeline employed specific criteria to reject RFF instances that did not meet quality standards for reliable and valid measurement. These rejection rules ensured data quality while balancing the need to retain sufficient samples for analysis.

Pre-pipeline rejection

Individual RFF instances were rejected if they there were misarticulations.

aRFF-A1 and aRFF-A2

The two pipelines aRFF-A1 and aRFF-A2 used unmodified rejection criteria from aRFF-AP,12 which include (1) too few periodic cycles (< 10), (2) failure to reach steady state or unstable, (3) glottalized, (4) aperiodic or irregular, (5) and sharp transition (for more information about these criteria, see Vojtech et al.,12). Additionally, individual instances were rejected if they caused an execution fault while running the algorithm.

aRFF-B

The rejection criteria for aRFF-B were developed through a systematic process aimed at replicating manual analysis performance. Using the aforementioned RFF training set from the Stepp Lab,25 criteria were added iteratively until the automated system achieved a reliability score of 0.93—the established benchmark for manual analysis training.

The development process incorporated rejection criteria from multiple sources. The foundation came from Lien and Stepp,13 who established criteria based on Praat signal processing characteristics. These initial criteria rejected samples when either the variance of pulse periods exceeded 2.9×10–6, or when pulse periods of the nine non-adjacent cycles (for offset RFF: cycles 1–9; for onset RFF: cycles 2–10) were 50% longer than the 65th percentile of all pulse periods.

Additional criteria from Vojtech and Heller Murray25 were then incorporated: samples were rejected if they contained fewer than ten glottal cycles or failed to reach steady state, with steady state failure defined as a penultimate RFF greater than 0.8 semitones (for offset RFF: cycle 2; for onset RFF: cycle 9).

To achieve the target reliability of 0.93, two final criteria were added: rejection of samples with phoneme gaps exceeding 250 ms, and rejection of samples showing vocal fry, identified by an average fundamental frequency below 100 Hz.

Pipeline Comparison

To evaluate the relative performance of the three automated pipelines and manual analysis, a comparison was conducted. The statistical evaluation included pairwise comparisons between all analysis methods (aRFF-A1, aRFF-A2, aRFF-B, and mRFF), using Pearson’s correlation coefficients (where values ≥ 0.80 indicate strong reliability), To quantify the magnitude of measurement differences between methods, the root mean square error (RMSE) we calculated. RMSE represents the average magnitude of the errors between paired measurements in semitones, with lower values indicating closer agreement between methods. To ensure robust comparisons and reduce measurement variability, averaged RFF values were averaged from nine utterance instances—exceeding the six instances recommended by Eadie and Stepp.5 These nine instances were selected from the beginning (trials 1–3), middle (trials 27–29), and end (trials 53–55) of each participant’s recorded sentence repetitions.

Results

Overview of Pipeline Performance

The three automated pipelines (aRFF-A1, aRFF-A2, and aRFF-B) were evaluated against manual analysis (mRFF) across four performance dimensions: accuracy of fricative identification, sample rejections, measurement reliability, and processing efficiency.

Accuracy of Fricative Identification

Manual validation of fricative identification was conducted by a trained RFF technician (E.T.) on a random 10% subset of the total available data. The aRFF-A0 algorithm used in pipeline aRFF-A1 required manual intervention in 4% of samples, while the PFA/HTK module used in the aRFF-A2 and aRFF-B pipelines needed intervention in only 1% of samples.

Sample Rejection Results

Rejection rates by pipeline

For aRFF-A1, 31% of offset samples and 47% of onset samples were rejected. The aRFF-A2 pipeline demonstrated moderately lower rejection rates, with 25% of offset samples and 43% of onset samples rejected. The aRFF-B pipeline exhibited the lowest rejection rates overall, with 10% of offset samples and 25% of onset samples rejected. The manual analysis (mRFF) rejected 3% of offset samples and 11% of onset samples. See Figure 3 for a summary of the rejection rates for each pipeline.

Figure 3.

Figure 3.

Percent of rejected samples from each of the proposed automated analysis procedures by relative fundamental frequency (RFF).

Rejection criteria distribution

aRFF-A1

Of the rejected offset samples for aRFF-A1, 39% were rejected for too few periodic cycles, none were rejected for failures to reach steady state or unstable, 6% were rejected for glottalization, none were rejected for aperiodic or irregular pulses, 54% were rejected for sharp transitions, and 1% were rejected for process failure. Of the rejected onset samples, 34% were rejected for too few periodic cycles, none were rejected for failures to reach steady state or unstable, 14% were rejected for glottalization, none were rejected for aperiodic or irregular pulses, 51% were rejected for sharp transitions, and 1% were rejected for process failure. See Figure 4 for a visual representation of the rejection distribution for pipelines aRFF-A1 and aRFF-A2.

Figure 4.

Figure 4.

Distribution of rejection criteria for automated pipelines aRFF-A1 and aRFF-A2 by relative fundamental frequency (RFF) offset and onset. For comparison between the two procedures, the percent of total samples that were rejected was plotted.

aRFF-A2

Of the rejected offset samples for aRFF-A2, 15% were rejected for too few periodic cycles, none were rejected for failures to reach steady state or unstable, 8% were rejected for glottalization, none were rejected for aperiodic or irregular pulses, 76% were rejected for sharp transitions, and 1% were rejected for process failure. Of the rejected onset samples, 21% were rejected for too few periodic cycles, none were rejected for failures to reach steady state or unstable, 18% were rejected for glottalization, none were rejected for aperiodic or irregular pulses, 61% were rejected for sharp transitions, and 1% were rejected for process failure. See Figure 4 for a visual representation of the rejection distribution in comparison to aRFF-A1.\

aRFF-B

Of the rejected offset samples for aRFF-B, 21% were rejected for failure to reach steady-state (penultimate RFF being greater than 0.8 ST), 29% were rejected for high variance in the periods (variance was greater than 2.9×10–6), 2% were rejected for a long pause between vowels (greater than 250-millisecond pause between phonemes), 7% were rejected for not enough pulses (fewer than 10 cycles), 2% were rejected for adjacent cycle (cycle 10) being an outlier (pulse period was 50% longer than 65th percentile of other pulse periods), 28% were rejected for non-adjacent cycles (cycles 1–9) containing an outlier, and 11% were rejected for vocal fry. Of the rejected onset samples, 13% were rejected for failure to reach steady-state, 23% were rejected for high variance in the periods, 2% were rejected for a long pause between vowels, 4% were rejected for not enough pulses, 3% were rejected for the adjacent cycle (cycle 1) being an outlier, 4% were rejected for non-adjacent cycles (cycles 2–10) containing an outlier, and 51% were rejected for vocal fry. See Figure 5 for a visual representation of the rejection distribution.

Figure 5.

Figure 5.

Distribution of rejection criteria for the novel automated pipeline aRFF-B by relative fundamental frequency (RFF) offset and onset. For offset RFF, the adjacent cycle is cycle 10, while the non-adjacent cycles are 1–9. For onset RFF, the adject cycle is cycle 1, while the non-adjacent cycles are 2–10.

Reliability and Measurement Accuracy

Reliability Coefficients

All pipelines demonstrated strong reliability with the manual analysis. The aRFF-A1 pipeline achieved a reliability coefficient of r = .85 (p < .05) with manual analysis, while aRFF-A2 showed reliability at r = .86 (p < .05). The aRFF-B pipeline demonstrated comparable reliability at r = .84 (p < .05). Inter-pipeline reliability was particularly strong between aRFF-A1 and aRFF-A2 at r = .98 (p < .05), while aRFF-A1 and aRFF-B showed good agreement with r = .87 (p < .05).

Measurement Error

Root Mean Square Error (RMSE of semitones) values were calculated for the offset cycle 10 and onset cycle 1 across all pipelines. These specific cycles are chosen due their relevance in the literature as the most related to vocal effort.26 The aRFF-A1 pipeline showed RMSE values of 0.71 [95% CI: 0.62–0.80] for offset 10 and 1.24 [95% CI: 1.01–1.47] for onset 1 measurements. The aRFF-A2 pipeline demonstrated slightly lower error with values of 0.68 [95% CI: 0.60–0.76] for offset 10 and 1.18 [95% CI: 0.95–1.35] for onset 1. The aRFF-B pipeline showed comparable offset error at 0.68 [95% CI: 0.57–0.80] but slightly higher onset error at 1.33 [95% CI: 1.08–1.58]. ANOVA testing revealed no significant differences between algorithms for either offset 10 (F = 0.90, p = 0.41) or onset 1 (F = 2.46, p = 0.09) measurements. See Figure 6 for a visualization of RMSE for each cycle for each pipeline. Additionally, the overall RMSE (averaged across all cycles) for aRFF-A1 was 0.46 [95% CI: 0.29, 0.63] ST, for aRFF-A2 was 0.45 [95% CI: 0.28, 0.62] ST, and aRFF-B was 0.46 [95% CI: 0.30, 0.63] ST.

Figure 6.

Figure 6.

RMSE (semitones) of the automated pipelines aRFF-A1, aRFF-A2 and aRFF-B. Error bars represent the standard error of the measurement.

Sample Usability

From the intended nine instances per measurement, the automated pipelines showed varying levels of usable samples. The aRFF-A1 pipeline averaged 6.9 offset and 5.5 onset usable instances, while aRFF-A2 showed slight improvement with 7.2 offset and 5.7 onset usable instances. The aRFF-B pipeline demonstrated better performance with 7.5 offset and 6.7 onset usable instances. Manual analysis (mRFF) achieved the highest usability with 8.8 offset and 8.2 onset instances. Table I summarizes the average number of usable utterances for each model.

Table 1.

Average number of usable RFF instances from a total of nine possible instances.

Model Offset Onset

aRFF-A1 6.9 5.5
aRFF-A2 7.2 5.7
aRFF-B 7.5 6.7
mRFF 8.8 8.2

Processing Efficiency

Processing time varied across the pipelines, with aRFF-A1 averaging 700 ms per sample, aRFF-A2 requiring 746 ms, and aRFF-B taking 859 ms. Notably, aRFF-B’s pulse segmentation algorithm supported parallel processing through MATLAB’s “parfor” function, allowing processing time to scale with available processors, potentially offsetting its longer base processing time in multi-processor environments. All processing times were measured on the same computer running Windows 11, with a 32-core processor (AMD Ryzen Threadripper PRO 5975WX; 3.60 GHz) and 192 GB of RAM.

Discussion

In order to improve efficiency of the analysis of large datasets of RFF samples from continuous speech, three new automated analysis pipelines were proposed and compared to manual analysis of a subset of the speech samples. The number of samples needing manual intervention for fricative identification was lower for aRFF-B and aRFF-A2 (1%) as compared to aRFF-A1 (4%). All three approaches rejected fewer utterances than expected from the corroborating study that used productions of uniform utterances from the same participant pool as the presented study (43% needing manual intervention18). The main difference between these data sets was, however, the type of speech sample. Here, VCV utterances were taken from sentences designed for RFF analysis while the other study used uniform VCV utterances—all from the same participants participating in a larger experiment. One possible explanation for this difference in fricative identification accuracy is that the isolated VCV utterances contained noise peaks following the offset of the vowel from bursts of air on the microphone from an overemphasized fricative production. This is a potential advantage of using continuous speech samples for RFF analysis as the speech signals have fewer instances of these bursts due to the natural quality of ordinary speech samples.

The newly adapted and developed automated approaches performed similarly in terms of reliability and validity compared to manual analysis. The reliability and error measurements demonstrated that all three automated approaches performed comparably to manual analysis, with correlation coefficients exceeding 0.80. Overall RMSE values were consistent across the three pipelines. These are higher than the previously reported RMSE on uniform utterances using aRFF-A0.12 The difference could be a result of the present study samples of continuous speech versus uniform utterances. These findings suggest that the automated pipelines can produce measurements that are statistically similar to those obtained through manual analysis while dramatically reducing the time and expertise required.

A prominent difference among the pipelines was in their sample rejection rates. One key comparison is the difference in rejected samples between aRFF-A1 [which rejected 31% of offset samples and 47% of onset samples] and aRFF-A2 [which showed moderately lower rejection rates at 25% of offset samples and 43% of onset samples]. The main difference in the rejected samples was a decrease in samples rejected for “too few periodic cycles” in both the offset and onset utterances for aRFF-A2. The difference in the algorithms was in fricative identification, thus this difference may be a result of misplaced fricatives in aRFF-A1. A cursory inspection of these samples revealed that over 90% of these samples were rejected because of misplaced fricatives from aRFF-A1.

The main reason for rejections (10% of offset samples and 25% of onset samples) in aRFF-B was vocal fry, which are typically excluded from RFF analysis due to irregular vibratory patterns.13 This presents a general limitation to using continuous speech as the presence of vocal fry is more common in this type of speech production, particularly in female speakers.27 All pipelines should reject similar samples due to the presence of vocal fry. However, aRFF-AP has different criteria for rejecting samples present with vocal fry, making it difficult to directly compare rejection rates between pipelines, as the samples could be split into rejection categories that have overlap with other signal qualities that warrant rejection. The considerable rejection of samples due to vocal fry stresses the importance of implementing a targeted training for participants in the adequate production of not only isolated but also embedded VCV utterances. Training participants to identify vocal fry while maintaining natural speech would minimize its occurrence in samples, thereby improving algorithm performance and reducing overall rejection rates.

The average number of usable utterances ranged from 5.5 to 7.5 for the automated pipelines and were lower than the average usable utterances for the manual analysis (8.8 and 8.2 for offset and onset respectively). This range of usable utterances is consistent with previous work in measuring RFF in continuous speech, which reported 5.7 average offset instances and 4.6 average onset instances from nine instances taken from continuous speech.28 Previous work using aRFF-AP reports less than 3% rejection for offset instances and 13% rejection for onset instances.29 This is comparable to the manual analysis (mRFF) in this study which rejected 3% of offset samples and 11% of onset samples. This substantially lower rejection rate affirms that aRFF-AP is better suited to its designed implementation of semi-automated analysis of uniform utterances and would need further refinement for continuous-speech application.

One important limitation of the current results is that the manual analysis is used as the “ground truth” for the validity of the RFF calculation. Vojtech et al.29 report that this may not be the case. However, there is not a currently available alternative to the manual analysis. Moreover, the data are restricted to female voices and currently comparable data for male speakers do not exist. For improved generalizability, the pipelines should be tested in a diverse data set of speakers with and without voice disorders.

All three pipelines performed sufficiently in fully automated analysis of RFF, despite aRFF-A1 and aRFF-A2 not developed intentionally for continuous speech or unaccompanied analysis. However, across the proposed analysis pipelines, aRFF-B is recommended for unaccompanied automated RFF analysis of continuous speech due to having fewer rejections overall and reduced computation time from potential parallel architecture.

Conclusion

The analysis of a large amount of speech data is only feasible with unaccompanied analysis approaches. The present work compares three approaches for automated analysis of RFF from continuous speech. All approaches (one novel and two variations of an existing approach, aRFF-AP) were sufficiently reliable and valid compared to the gold standard of manual analysis. All three pipelines performed sufficiently in fully automated analysis of RFF, despite aRFF-A1 and aRFF-A2 not being developed intentionally for continuous speech or unaccompanied analysis. However, across the proposed analysis pipelines, aRFF-B is recommended for unaccompanied automated RFF analysis of continuous speech due to having fewer rejections overall and reduced computation time from potential parallel architecture. A notable limitation was the high rate of rejection of samples due to vocal fry. This highlights the importance of implementing targeted participant training for adequate production of embedded VCV utterances to reduce vocal fry occurrences and improve sample retention across all algorithms while preserving naturalistic speech. These automated approaches address a significant gap in vocal effort research, where the lack of efficient measurement tools has historically limited large-scale investigations. By enabling automated RFF analysis in continuous speech, these methods will facilitate more comprehensive studies of vocal effort across different speaking conditions and populations in speech-language pathology and related disciplines such as forensic voice analysis and speech biomarker analysis in psychiatry, while improving ecological validity.

Acknowledgements

We thank Ashton Bernskoetter, Taylor Hall, Katherine Johnson, Haley McCabe, Melinda Pfeiffer, Erin Tippit and Allison Walker for assistance with data collection and Allison Harris, Melinda Pfeiffer, and Erin Tippit for assistance with RFF manual analysis. We thank Matthew Page, MD, for reviewing laryngeal videostroboscopies.

Funding

Research reported in this publication was supported by the National Institute On Deafness and Other Communication Disorders of the National Institutes of Health under Award Number R01DC018026.

Footnotes

Ethics Approval

The study was approved by the Institutional Review Board of the University of Missouri (IRB project number 2006447).

Conflict of Interest

The authors have no known conflicts of interest to disclose.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Data and Code Availability

The MATLAB script used for aRFF-B can be downloaded at https://github.com/mark-berardi/rff_auto_mb. Data are available upon request from the corresponding author: Mark Berardi: https://orcid.org/0000-0003-0491-0952.

References

  • 1.Stepp CE, Hillman RE, Heaton JT. The impact of vocal hyperfunction on relative fundamental frequency during voicing offset and onset. J Speech Lang Hear Res. 2010;53(5):1220–1226. doi: 10.1044/1092-4388(2010/09-0234) [DOI] [PubMed] [Google Scholar]
  • 2.Dahl KL, Stepp CE. Effects of cognitive stress on voice acoustics in individuals with hyperfunctional voice disorders. Am J Speech Lang Pathol. 2023;32(1):264–274. doi: 10.1044/2022_AJSLP-22-00204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.McKenna VS, Stepp CE. The relationship between acoustical and perceptual measures of vocal effort. J Acoust Soc Am. 2018;144(3):1643. doi: 10.1121/1.5055234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Vojtech JM, Stepp CE. Effects of age and Parkinson’s Disease on the relationship between vocal fold abductory Kinematics and Relative Fundamental Frequency. J Voice. 2022. doi: 10.1016/j.jvoice.2022.03.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Eadie TL, Stepp CE. Acoustic correlate of vocal effort in spasmodic dysphonia. Annals of Otology, Rhinology and Laryngology. 2013;122(3):169–176. doi: 10.1177/000348941312200305 [DOI] [PubMed] [Google Scholar]
  • 6.Roy N, Fetrow RA, Merrill RM, Dromey C. Exploring the clinical utility of relative fundamental frequency as an objective measure of vocal hyperfunction. Journal of Speech, Language, and Hearing Research. 2016;59(5):1002–1017. doi: 10.1044/2016_JSLHR-S-15-0354 [DOI] [PubMed] [Google Scholar]
  • 7.Stepp CE, Merchant GR, Heaton JT, Hillman RE. Effects of voice therapy on relative fundamental frequency during voicing offset and onset in patients with vocal hyperfunction. J Speech Lang Hear Res. 2011;54(5):1260–1266. doi: 10.1044/1092-4388(2011/10-0274) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ferrán S, Rodríguez-Zanetti C, Garaycochea O, et al. Relative fundamental frequency: Only for hyperfunctional voices? A pilot study. Bioengineering (Basel). 2024;11(5). doi: 10.3390/bioengineering11050475 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lien Y-AS, Gattuccio CI, Stepp CE. Effects of phonetic context on relative fundamental frequency. Journal of Speech, Language, and Hearing Research. 2014;57(4):1259–1267. doi: 10.1044/2014_JSLHR-S-13-0158 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lowell SY, Barkmeier-Kraemer JM, Hoit JD, Story BH. Respiratory and laryngeal function during spontaneous speaking in teachers with voice disorders. Journal of Speech, Language, and Hearing Research. 2008;51(2):333–349. doi: 10.1044/1092-4388(2008/025) [DOI] [PubMed] [Google Scholar]
  • 11.Cheema AJ, Marks KL, Ghasemzadeh H, van Stan JH, Hillman RE, Mehta DD. Characterizing vocal hyperfunction using ecological momentary assessment of relative fundamental frequency. J Voice. 2024. doi: 10.1016/j.jvoice.2024.10.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Vojtech JM, Segina RK, Buckley DP, et al. Refining algorithmic estimation of relative fundamental frequency: Accounting for sample characteristics and fundamental frequency estimation method. J Acoust Soc Am. 2019;146(5):3184. doi: 10.1121/1.5131025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lien Y-AS, Stepp CE. Automated estimation of relative fundamental frequency. In: Vol. 2013. IEEE; 2013:2136–2139. [DOI] [PubMed] [Google Scholar]
  • 14.McKenna VS, Vojtech JM, Previtera M, Kendall CL, Carraro KE. A scoping literature review of relative fundamental frequency (RFF) in individuals with and without voice disorders. Appl Sci (Basel). 2022;12(16):8121. doi: 10.3390/app12168121 [DOI] [Google Scholar]
  • 15.Fairbanks G Voice and articulation drillbook. (No Title). 1960. [Google Scholar]
  • 16.Kempster GB, Gerratt BR, Verdolini Abbott K, Barkmeier-Kraemer J, Hillman RE. Consensus Auditory-Perceptual Evaluation of Voice: Development of a standardized clinical protocol. Am J Speech Lang Pathol. 2009;18(2):124–132. doi: 10.1044/1058-0360(2008/08-0017) [DOI] [PubMed] [Google Scholar]
  • 17.Dahl KL, Stepp CE. Changes in relative fundamental frequency under increased cognitive load in individuals with healthy voices. J Speech Lang Hear Res. 2021;64(4):1189–1196. doi: 10.1044/2021_JSLHR-20-00134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Omotara G, Berardi M, Dietrich M, DeSouza GN. A pipeline consisting of pattern recognition and finite automata for recognizing VCV productions in the study of vocal hyperfunction. In: 2021 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE; 2021:1–7. [Google Scholar]
  • 19.Ghasemzadeh H, Hillman RE, Mehta DD. Toward generalizable machine learning models in speech, language, and hearing sciences: Estimating sample size and reducing overfitting. J Speech Lang Hear Res. 2024;67(3):753–781. doi: 10.1044/2023_JSLHR-23-00273 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Nanjundeswaran C, Jacobson BH, Gartner-Schmidt J, Verdolini Abbott K. Vocal Fatigue Index (VFI): Development and validation. J Voice. 2015;29(4):433–440. doi: 10.1016/j.jvoice.2014.09.012 [DOI] [PubMed] [Google Scholar]
  • 21.Gao Y, Dietrich M, DeSouza GN. Classification of vocal fatigue using sEMG: Data imbalance, normalization, and the role of Vocal Fatigue Index scores. Applied Sciences. 2021;11(10):4335. doi: 10.3390/app11104335 [DOI] [Google Scholar]
  • 22.Yuan J, Liberman M. Speaker identification on the SCOTUS corpus. J Acoust Soc Am. 2008;123(5):3878. doi: 10.1121/1.2935783 [DOI] [Google Scholar]
  • 23.Young SJ, Young S. The HTK hidden Markov model toolkit: Design and philosophy. 1993. [Google Scholar]
  • 24.Praat: Doing Phonetics by Computer; 2013. [Google Scholar]
  • 25.Vojtech JM, Heller Murray E. Tutorial for Manual Relative Fundamental Frequency (RFF) Estimation Using Praat. 2019. Published 2019. https://sites.bu.edu/stepplab/research/rff/ [Google Scholar]
  • 26.Heller Murray ES, Lien Y-AS, Van Stan JH, Mehta DD, Hillman RE, Pieter Noordzij J, Stepp CE. Relative fundamental frequency distinguishes between phonotraumatic and non-phonotraumatic vocal hyperfunction. J Speech Lang Hear Res. 2017;60(6):1507–1515. doi: 10.1044/2016_jslhr-s-16-0262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wolk L, Abdelli-Beruh NB, Slavin D. Habitual use of vocal fry in young adult female speakers. J Voice. 2012;26(3):e111–6. doi: 10.1016/j.jvoice.2011.04.007 [DOI] [PubMed] [Google Scholar]
  • 28.Buckley DP, Vojtech JM, Stepp CE. Relative fundamental frequency in individuals with globus syndrome and muscle tension dysphagia. J Voice. 2024;38(3):612–618. doi: 10.1016/j.jvoice.2021.10.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Vojtech JM, Cilento DD, Luong AT, et al. Acoustic identification of the voicing boundary during intervocalic offsets and onsets based on vocal fold vibratory measures. Appl Sci (Basel). 2021;11(9). doi: 10.3390/app11093816 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The MATLAB script used for aRFF-B can be downloaded at https://github.com/mark-berardi/rff_auto_mb. Data are available upon request from the corresponding author: Mark Berardi: https://orcid.org/0000-0003-0491-0952.

RESOURCES