Abstract
Current de-facto dysfluency modeling methods [1, 2] utilize template matching algorithms which are not generalizable to out-of-domain real-world dysfluencies across languages, and are not scalable with increasing amounts of training data. To handle these problems, we propose Stutter-Solver: an end-to-end framework that detects dysfluency with accurate type and time transcription, inspired by the YOLO [3] object detection algorithm. Stutter-Solver can handle co-dysfluencies and is a natural multi-lingual dysfluency detector. To leverage scalability and boost performance, we also introduce three novel dysfluency corpora: VCTK-Pro, VCTK-Art, and AISHELL3-Pro, simulating natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation through articulatory-encodec and TTS-based methods. Our approach achieves state-of-the-art performance on all available dysfluency corpora. Code and datasets are open-sourced at https://github.com/eureka235/Stutter-Solver.
Keywords: dysfluency, co-dysfluency, end-to-end, multi-lingual, simulation, aphasia, clinical
1. INTRODUCTION
Speech dysfluency modeling is the core module in speech therapy. The U.S. speech therapy market is projected to reach USD 6.93 billion by 2030 [4]. Technically, speech dysfluency modeling is a speech recognition problem, which is dominated by large-scale developments [5, 6, 7, 8]. However, those large ASR models struggle with dysfluent speech because ASR is a dysfluency removal process. For a long time, researchers have mainly treated it as a classification problem. Early methods relied on hand-crafted features [9, 10, 11, 12, 13]. More recently, end-to-end classification tasks have been developed [14, 15, 16, 17, 18]. However, two big problems remain. First, dysfluency depends on the text, which previous methods have ignored. Second, simple classification is too basic to be deployed in real speech therapy systems. [1] proposed 2D-alignment, the alignment between reference text and phoneme-level forced alignment. Then, the template matching algorithm (with each dysfluency type as a template) was performed for dysfluency detection. The subsequent work, H-UDM [2], proposed recursive UDM [1] that updates word boundary segments together with alignment prediction. Nevertheless, these methods still exhibit certain limitations. Firstly, UDM [1] and H-UDM [2] are essentially feature engineering approaches, which may not adequately handle real-world dysfluencies that do not conform to predefined templates. Secondly, developing templates for each language is impractical, as dysfluency templates are inherently language-dependent. Thirdly, template matching algorithms do not utilize training data, rendering them non-scalable with respect to dysfluency data whenever it is available.
To handle the aforementioned limitations, we approach dysfluency modeling from a simple and new perspective. Dysfluency modeling can be regarded as an object detection problem in the 1D domain. As such, we conceptualize this as a detection task, inspired by YOLO [3]. We propose Stutter-Solver, which takes dysfluent speech and reference ground truth text as input, and directly predicts dysfluency types and time boundaries in an end-to-end manner. Note that [19] uses the similar idea. However, Stutter-Solver focuses on: co-dysfluencies, multi-linguality, articulatory-simulation and co-dysfluency TTS-simulation. Stutter-Solver requires high-quality annotated dysfluency data (with precisely annotated type and time boundaries).Therefore, we propose an innovative dysfluency simulation method: articulatory-based [20], and we performed comparative experiments with TTS-based methods. We developed three synthetic dysfluency datasets: VCTK-Pro and AISHELL3-Pro, using VITS [21] for the TTS-based method; additionally, VCTK-Art using Articulatory Encodec [20] as a vocal tract articulation simulation tool. Both VCTK-Pro and VCTK-Art build upon the VCTK corpus [22], whereas AISHELL3-Pro builds upon the AISHELL3 corpus [23]. These datasets include repetition, missing, block, replacement, and prolongation at phoneme & word levels for English, and at the character-level for Mandarin. As such, Stutter-Solver is naturally a multi-lingual co-dysfluency detector with no hand-crafted templates involved. As part of the speech therapy process, we have 38 English and 8 Chinese Mandarin-speaking nfvPPA subjects [24] from clinical collaborations. The proposed Stutter-Solver achieved state-of-the-art accuracy, bound loss, and time F1 score on our new benchmark (VCTK-Art, VCTK-Pro, AISHELL3-Pro), public corpus, and nfvPPA speech.
2. ARTICULATORY-BASED SIMULATION
Previous research on dysfluent speech simulation [1, 14] has focused on direct manipulation of waveform, which has resulted in poor naturalness, evidenced in Table. 2. To address this limitation, we perform simulation in two orthogonal spaces: articulatory space and textual space. This section details articulatory-based simulation, while section 3 elaborates on the textual space approach (TTS-based simulation).
Table 2:
MOS for Simulated datasets
| Dysfluency Type | VCTK++ | VCTK-Art | VCTK-Pro | AISHELL3-Pro |
|---|---|---|---|---|
|
| ||||
| Repetition | 1.40 ± 0.70 | 2.61 ± 1.05 | 3.33 ± 0.86 | 3.88 ± 0.73 |
| Missing | N/A | 3.44 ± 1.23 | 3.89 ± 1.05 | 3.37 ± 1.06 |
| Block | 2.80 ± 0.63 | 3.35 ± 1.13 | 3.22 ± 1.09 | 2.96 ± 1.03 |
| Replace | N/A | 3.48 ± 1.42 | 2.62 ± 1.21 | 3.13 ± 0.74 |
| Prolongation | 1.20 ± 0.79 | 2.55 ± 0.53 | 3.00 ± 1.00 | 2.64 ± 1.12 |
|
| ||||
| Overall | 1.80 ± 0.74 | 3.08 ± 1.12 | 3.21 ± 0.97 | 3.19 ± 0.95 |
For articulatory-based method, we simulate dysfluency by directly editing the articulatory control space, by utilizing an offline articulatory inversion and synthesis models (Articulatory Encodec [20]). The Articulatory Encodec is composed of acoustic-to-articulatory inversion (AAI) model and an articulatory vocoder. [20] shows that the articulatory encodec can successfully applied to arbitrary accents and speaker identities with high-performance. The pipeline is detailed below.
2.1. Method Pipeline
We first run MFA to align raw VCTK speech with its ground truth text, obtaining 50 Hz phoneme-level force alignment that matches the EMA features from the AAI module. Various types of dysfluency are then introduced by editing the EMA features: Repetition: The target phoneme segment is duplicated 2–4 times. Replace: We sample a random phoneme from the current EMA feature to replace the target phoneme. Block: A silence frame with 10–15 units is inserted after the target phoneme, with the silence frames sampled from the beginning of the current EMA feature. Missing: The target phoneme is removed. Prolongation: Interpolating within the target phoneme, extending its duration by 4 to 6 times its original length. For repetition and prolongation, the target phonemes are respectively the first phoneme of a randomly selected word and a randomly chosen vowel. For other dysfluency types, the target phonemes are selected arbitrarily without any other restrictions. To ensure smooth auditory perception, we insert a 2-unit interpolate buffer frame before and after each modification. All the interpolation operations mentioned above use bilinear interpolation. Besides phoneme-level modifications above, we also implemented word-level repetition and missing, where the target word is modified instead of the target phoneme, with all other aspects remaining identical. The whole pipeline is depicted in Fig. 2.
Fig. 2:

Pipeline of articulatory-based simulation.
Note that only the English version of articulatory-encodec model is available at this time, limiting our simulation contribution to English. However, we explored multi-lingual simulation in TTS-based Simulation, detailed in the next section.
3. MULTI-LINGUAL TTS-BASED SIMULATION
3.1. Method pipeline
The pipeline of TTS-based simulation can be divided into following steps: 1) Dysfluency injection: for VCTK-Pro, we convert VCTK [22] text into IPA sequences via the VITS phonemizer, and for AISHELL3-Pro, we convert Mandarin text into pinyin sequences. We then add different types of dysfluencies at the phoneme/word(English) and pinyin (Chinese) level according to the TTS rules (Sec. 3.2). 2) VITS [21] inference: We take dysfluency-injected IPA/Pinyin sequences as inputs, conduct the VITS inference procedure and obtain the dysfluent speech. 3) Annotation: We retrieve phoneme alignments from VITS duration model, annotate the type of dysfluency on the dysfluent region.
3.2. Co-Dysfluency TTS rules
For VCTK-Pro, we incorporate phoneme and word-level dysfluency; for AISHELL3-Pro, we introduce character-level dysfluency. Dysfluencies are simulated via TTS rules [19], with examples provided in Fig. 3.
Fig. 3:

TTS rules for VCTK-Pro and AISHELL3-Pro
In VCTK-Pro, we introduce co-dysfluency, adding multiple dysfluencies into a single utterance. Co-dysfluency is categorized into single-type and multi-type. For single-type, we insert 2–3 instances of the same type of dysfluency (involves every type mentioned above) at various positions within an utterance. For multi-type, we incorporate 5 combinations of dysfluencies: (rep-missing), (rep-block), (missing-block), (replace-block) and (prolong-block), with 2 random positions chosen for each combination within the utterance. Note that due to ethic concerns, de-identification techniques [25] might also be involved in the process.
The statistics of three simulated datasets are detailed in Table. 1.
Table 1:
Statistics of simulated datasets (hours)
| Dysfluency | VCTK-Pro | VCTK-Art | AISHELL3-Pro |
|---|---|---|---|
|
| |||
| Repetition | 258.33 | 111.34 | 102.37 |
| Missing | 180.89 | 107.43 | 100.22 |
| Block | 132.41 | 56.95 | 104.91 |
| Replace | 128.18 | 56.85 | 100.84 |
| Prolongation | 87.50 | 59.62 | 103.03 |
| Co-dysfluency | 337.84 | - | - |
|
| |||
| Total | 1125.1 | 392.19 | 511.37 |
4. DYSFLUENCY DETECTION AS OBJECT DETECTION
Accurate dysfluency detection necessitates handling text dependencies since stutters are not necessarily monotonic. In this work, we adopt the soft speech-text alignment from VITS [21], which is one of the SOTA TTS models. Given this speech-text alignment as input, our model requires two main components: an optimal spatial and temporal downsampling method, and an extraction mechanism to accurately attend to the relevant dysfluent signal. Region-wise dysfluency detection can be viewed as a 1D extension of the 2D object detection problem in computer vision, drawing inspiration from the YOLO [3] method, we design a detector that takes the soft speech-text alignment and produces a fixed size 64 × 8 (temporal dim × output dim) output matrix. At each timestep, 8 values are predicted: dysfluency start & end bounds, confidence score, and C (=5) class predictions. The detector which utilizes a region-wise prediction scheme consists of spatial pattern collector blocks followed by a temporal analysis unit. The entire paradigm is shown in Fig. 1 and the corresponding modules are detailed in the following.
Fig. 1:

We utilize the pretrained VITS speech and text encoders to process spectrogram and reference text respectively, generating the soft speech-text alignments passed into the detector. The output matrix contains exist confidence score and 5 types of type confidence scores (start & end bounds are left out in the paradigm). The higher the brightness, the higher the score, indicating the existence and type of dysfluency. a) shows the series nature of our detector with spatial encoder and subsequent temporal encoder. b) is a diagram for a spatial encoder block - grouped convolutions are important for extracting local spatial features without completely collapsing information across the text dimension.
4.1. Soft speech-text alignments
We obtain monotonic attention matrix from VITS [21] that represents how each input phoneme aligns with target speech, where is text dimension and the speech duration. For training, we use the soft alignments and apply a softmax operation across the text dimension, computing the maximum attention value for each time step. To calculate the soft alignments, we use the original pre-trained text and speech encoders. The former is a Transformer encoder [26] with relative positional embeddings, and the latter is a model that uses non-causal residual blocks used in WaveGlow [27]. This soft-alignment attention matrix is then passed into the detection head for training and inference.
4.2. Spatial-Temporal Encoders
We adopt the same spatial-temporal encoders as [19]. It consists of a region-wise spatial encoder and a temporal encoder.
4.2.1. Spatial Encoder
Learnable spatial pattern collector blocks are used to preserve local spatial features. Here, we are going to elaborate more on the intuition. Traditional speech recognition tasks take speech features such as mel spectrograms as input, and the de facto encoder [28] is applied. However, the soft alignments mentioned are spatially different from speech features, such that separate convolutions (pointwise and depthwise) will be ineffective for such input representations. Therefore, a modified convolution paradigm was proposed [19], where a depthwise convolution followed by a grouped convolution, instead of a pointwise convolution, is adopted in this setting. This has been experimentally proven to preserve region-wise information, as visualized in Fig. 1 (b).
4.2.2. Temporal Encoder
Since the task is to predict time-aware dysfluencies, technically it is a region-wise aggregation, which is a 1D sequential timing problem. To achieve this, the transformer encoder [26] is simply applied to handle both global and local timing alignments. We employed transformer-base in this setting.
4.3. Training Loss
The speech utterance is split into segments of fixed steps. For each segment, we are going to predict three things: (1) the dysfluency confidence score (if and how confident we are that dysfluency exists in this segment), denoted as , (2) the boundary of the dysfluencies and , and (3) the dysfluency type : whether that dysfluency is a block, repetition, replacement, insertion, or missing word. The bound values are normalized between 0–1 using fixed padded lengths as the max bound values. The balancing factors are , and is the number of regions and is the number of classes. indicates whether the dysfluency appears in that segment. The loss function is denoted by the following equation:
5. EXPERIMENTS
5.1. Datasets
VCTK [22] includes recordings from 109 native English speakers. Each speaker reads out about 400 sentences from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. This corpus encompasses about 48 hours of accented speech. It is used for simulating VCTK-Pro and VCTK-Art.
AISHELL-3 [23] is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which includes 218 native Chinese mandarin speakers with roughly 85 hours of emotion-neutral recordings. It is used for simualting AISHELL3-Pro.
LibriStutter [14] is a synthesized dataset which contains artificially stuttered speech and stutter classification labels for 5 stutter types. It was generated using 20 hours of audio selected from the ‘dev-clean-100’ section of [29].
UCLASS [30] contains recordings from 128 children and adults who stutter. Only 25 files have been annotated and did not annotate for the block class, we only used those files and did not use the block class for subsequent datasets.
SEP-28K is curated by [31], contains 28,177 clips extracted from publicly available podcasts. These clips are labeled with five event types including block, prolongation, sound / word repetition and interjection. Clips labeled as “unsure” in the were excluded from the dataset.
Aphasia Speech is collected from our clinical collaborators, our dysfluent data comprises 46 participants (38 English speakers and 8 Chinese mandarin speakers) diagnosed with Primary Progressive Aphasia (PPA), larger than the data used in [1, 2] which only has 3 English speakers.
5.2. Training
We trained the detector for 30 epochs using a 90/10 train/test split, which was separately applied to three simulated datasets. We utilized a batch size of 64 and leveraged the Adam optimizer, configured with beta values of 0.9 and 0.999 and a learning rate of 3e-4. We choose not to use dropout or weight-decay in our training process. Training on VCTK-Art, VCTK-Pro (without co-dysfluency) and AISHELL3-Pro requires a total of 39, 41 and 36 hours respectively, on a RTX A6000.
5.3. Metrics
Phoneme Error Rate (PER) is a measure of how many errors (inserted, deleted, and changed phonemes) are predicting phoneme sequences compared to the actual phoneme sequence. It calculated by dividing the number of phoneme errors by the total number of phonemes.
Accuracy (Acc.) refers to the correctness of predictions regarding types of dysfluency within regions that exhibit some form of dysfluency.
Bound loss is calculated as the mean squared error between the predicted and actual boundaries of dysfluent regions within a 1024-length padded spectrogram, which is then converted to a time scale using a 20ms sampling frequency. For co-dysfluency analyses, the bound loss is averaged across all identified dysfluent regions.
Time F1 [1] measures the accuracy of boundary predictions by assessing the overlap between predicted and actual dysfluent region bounds. A sample is classified as a True Positive if any intersection occurs between these bounds.
5.4. Evaluation of dysfluency simulation
5.4.1. MOS tests
To evaluate the rationality and naturalness of three datasets we constructed, we collected Mean Opinion Score (MOS, 1–5) ratings from 11 people. The results are displayed in Table. 2. Our three simulated datasets were perceived to be far more natural than the VCTK++ [1] baseline corpus. Notably, VCTK-Pro was rated as closely mimicking human speech.
5.4.2. Dysfluency intelligibility
In order to verify the intelligibility of simulated datasets, we use phoneme recognition model [32] to evaluate the raw VCTK (/) and various types of dysfluent speech from VCTK-Pro and VCTK-Art. Results in Table. 3 show generally low PER, indicating good intelligibility and usability despite higher PERs than raw VCTK. Comparatively, VCTK-Pro performs better overall, while VCTK-Art excels particularly in repetition and block. AISHELL3-Pro was not evaluated due to the lack of available high-quality pinyin-level Chinese speech recognition model.
Table 3:
Phoneme Transcription Evaluation
| PER (% ↓) | ||||||
|---|---|---|---|---|---|---|
| Type | / | Repetition | Missing | Block | Replace | Prolongation |
|
| ||||||
| VCTK-Art | 6.243 | 8.328 | 8.250 | 10.118 | 9.665 | 9.893 |
| VCTK-Pro | 8.869 | 7.600 | 11.974 | 8.004 | 6.346 | |
5.5. Dysfluency detection
To assess the performance of trained detector, we conduct evaluations on three simulated datasets, as well as on the PPA data. The results, which include type-specific detection accuracy and bound loss metrics, are detailed in Table. 4 for the simulated datasets and in Table. 5 for the PPA data. Additionally, we compared our results with previous works by validating it on UCLASS, Libristutter, and SEP-28K, where we computed type-specific accuracy and Time F1, as shown in Table. 6.
Table 4:
Accuracy (Acc) and Bound loss (BL) of the five dysfluency types trained on the VCTK-Pro and VCTK-Art.
| Methods | Trainable parameters | Dataset | Rep | Block | Miss | Replace | Prolong | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc.% | BL | Acc.% | BL | Acc.% | BL | Acc.% | BL | Acc.% | BL | |||
|
| ||||||||||||
| H-UDM [2] | 92M | VCTK-Art | 84.29 | 29 ms | 97.59 | 24ms | 29.11 | 27ms | - | - | - | - |
| Stutter-Solver(VCTK-Art) | 33M | VCTK-Art | 87.55 | 21ms | 99.64 | 15ms | 91.17 | 12ms | 66.81 | 15ms | 79.16 | 17ms |
|
| ||||||||||||
| H-UDM [2] | 92M | VCTK-Pro | 74.66 | 68ms | 88.44 | 85ms | 15.00 | 100ms | - | - | - | - |
| Stutter-Solver(VCTK-Pro) | 33M | VCTK-Pro | 98.78 | 27ms | 98.71 | 78ms | 70.00 | 8ms | 73.33 | 10ms | 93.74 | 12ms |
|
| ||||||||||||
| Stutter-Solver(AISHELL3-Pro) | 33M | AISHELL3-Pro | 93.33 | 17ms | 99.98 | 52 ms | 95.00 | 2ms | 95.16 | 13ms | 96.55 | 16ms |
Table 5:
Dysfluency evaluation on Aphasia speech.
| Methods | Ave. Acc. (%, ↑) | BestAcc.(%, ↑) | Ave. BL (ms, ↓) |
|---|---|---|---|
|
| |||
| H-UDM [2] | 41.8 | 70.22 | 52ms |
| Stutter-Solver(VCTK-Art) | 52.82 | 93.47 (Repetition) | 41ms |
| Stutter-Solver(VCTK-Pro) | 54.19 | 92.54 (Block) | 21ms |
| Stutter-Solver(AISHELL3-Pro) | 72.37 | 94.85 (Block) | 13ms |
Table 6:
Type-specific accuracy (ACC) and time F1-score
| Methods | Dataset | Accuracy (%, ↑) | Time F1 (↑) | ||
|---|---|---|---|---|---|
| Rep | Prolong | Block | |||
|
| |||||
| Kourkounakis et al. [14] | UCLASS | 84.46 | 94.89 | - | 0 |
| Jouaitietal. [16] | UCLASS | 89.60 | 99.40 | - | 0 |
| Lian et al. [2] | UCLASS | 75.18 | - | 50.09 | 0.700 |
| Stutter-Solver (VCTK-Art) | UCLASS | 82.56 | 84.83 | 64.42 | 0.806 |
| Stutter-Solver (VCTK-Pro) | UCLASS | 92.00 | 91.43 | 56.00 | 0.893 |
|
| |||||
| Kourkounakis et al. [14] | LibriStutter | 82.24 | 92.14 | - | 0 |
| Lian et al. [2] | LibriStutter | 85.00 | - | - | 0.660 |
| Stutter-Solver (VCTK-Art) | LibriStutter | 89.04 | 62.58 | - | 0.686 |
| Stutter-Solver (VCTK-Pro) | LibriStutter | 89.71 | 67.74 | - | 0.697 |
|
| |||||
| Jouaiti et al. [16] | SEP-28K | 78.70 | 93.00 | - | 0 |
| Lian et al. [2] | SEP-28K | 70.99 | - | 66.44 | 0.699 |
| Stutter-Solver (VCTK-Art) | SEP-28K | 78.31 | 74.99 | 68.02 | 0.786 |
| Stutter-Solver (VCTK-Pro) | SEP-28K | 82.01 | 89.19 | 68.09 | 0.813 |
In Table.4, we used H-UDM [2] as the baseline. Both versions of Stutter-Solver(VCTK-Art and VCTK-Pro) surpassed H-UDM across all metrics. Notably, Stutter-Solver(VCTK-Pro) showed stronger results for English. Additionally, AISHELL3-Pro performed even better, likely due to the unique pronunciation traits of Chinese and its noticeable character-level dysfluency. In Table. 6, we presented our results using publicly available datasets (UCLASS, LibriStutter, and SEP-28K). Since the original benchmarks use private test sets, direct accuracy comparisons may not be completely fair. We instead emphasized time-aware detection, reporting the Time F1 score for each dataset. All baselines, except H-UDM, scored 0. Our proposed methods consistently outperformed H-UDM in these evaluations. In Table. 5, both versions of Stutter-Solver outperformed H-UDM, and the Chinese model performed best on Chinese PPA. However, the average accuracy remained low, underscoring the challenge of accurately capturing the real distribution of dysfluency.
5.6. Co-dysfluency
In Section 3.2, we incorporated co-dysfluency into VCTK-Pro. We trained Stutter-Solver on single-type, multi-type, and mixed-type (single & multi) co-dysfluency respectively, and measured average accuracy and bound loss using corresponding simulated data. The results, shown in Table. 7, demonstrate that our detector’s performance on co-dysfluency matches its capability in simpler scenarios with only one dysfluency per utterance. This indicates that our detector handles co-dysfluency effectively. It is worth noting that this fundamental property is missing in previous work.
Table 7:
Evaluation of Co-dysfluency
| Methods | Co-dysfluency | Ave Acc.(%, ↑) | Ave. BL (ms, ↓) |
|---|---|---|---|
|
| |||
| Stutter-Solver(/) | / | 91.24 | 29ms |
|
| |||
| Stutter-Solver(Single-type) | Single-type | 90.22 | 26ms |
| Stutter-Solver(Multi-type) | Multi-type | 89.59 | 15ms |
| Stutter-Solver(Mix-type) | Mix-type | 90.08 | 24 ms |
5.7. Multi-lingual
In addition to training the detector separately on single languages, we trained it simultaneously on two languages to evaluate its performance in a multi-lingual scenario. We randomly sampled 300 hours of data from both VCTK-Pro and AISHELL3-Pro for training. The results, presented in Table. 8, show that multi-lingual training slightly improved detection performance for English but significantly reduced it for Chinese compared with training separately on a single language. This indicates that multi-lingual training has varying effects on detection accuracy depending on the language. It is important to note that our method does not require additional language-specific dysfluency templates, in contrast to the previous state-of-the-art work by [2].
Table 8:
Evaluation of Multi-lingual
| Methods | Dataset | Ave Acc.(%, ↑) | Ave. BL (ms, ↓) |
|---|---|---|---|
|
| |||
| Stutter-Solver(VCTK-Pro) | VCTK-Pro | 91.24 | 29ms |
| Stutter-Solver(Multi-lingual) | VCTK-Pro | 93.98 | 21ms |
|
| |||
| Stutter-Solver(AISHELL3-Pro) | AISHELL3-Pro | 96.00 | 20ms |
| Stutter-Solver(Multi-lingual) | AISHELL3-Pro | 86.88 | 43ms |
6. CONCLUSIONS AND LIMITATIONS
We propose Stutter-Solver that detects speech dysfluencies in an end-to-end manner. Stutter-Solver is able to handle co-dysfluencies within the utterance and is a natural multi-lingual dysfluency detector. We proposed three annotated dysfluency simulated corpora such that Stutter-Solver achieves state-of-the-art performance on a couple of dysfluency corpora. However, limitations exist. First, the performance on real nfvPPA speech is far worse than that on simulated speech. Future work will focus on reducing the gap between simulated and real dysfluency distributions. Second, the proposed simulated corpora are not at a large scale and we have not reached the limit. Future work will focus on pushing the limit of scaling efforts when more data and resources are available. Third, it is worth exploring the simulation in gestural space [33, 34] or rtMRI space [35] instead of articulatory EMA space for finer-grained control. It is also worth exploring both speaker-dependent and speaker-independent dysfluencies via disentangled analysis and synthesis [36, 37, 38, 39, 40], which serves as the foundation for behavioral dysfluency study.
7. ACKNOWLEDGEMENT
Thanks for support from UC Noyce Initiative, Society of Hellman Fellows, NIH/NIDCD and the Schwab Innovation fund.
8. REFERENCES
- [1].Lian Jiachen, Feng Carly, Farooqi Naasir, Li Steve, Kashyap Anshul, Jun Cho Cheol, Wu Peter, Netzorg Robbie, Li Tingle, and Anumanchipalli Gopala Krishna, “Unconstrained dysfluency modeling for dysfluent speech transcription and detection,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Lian Jiachen and Anumanchipalli Gopala, “Towards hierarchical spoken language dysfluency modeling,” Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024. [Google Scholar]
- [3].Redmon Joseph, Divvala Santosh, Girshick Ross, and Farhadi Ali, “You only look once: Unified, real-time object detection,” 2016.
- [4].“ppa market,” https://www.fortunebusinessinsights.com/u-s-speech-therapy-market-105574.
- [5].Radford Alec, Kim Jong Wook, Xu Tao, Brockman Greg, McLeavey Christine, and Sutskever Ilya, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518. [Google Scholar]
- [6].Zhang Yu, Han Wei, Qin James, Wang Yongqiang, Bapna Ankur, Chen Zhehuai, Chen Nanxin, Li Bo, Axelrod Vera, Wang Gary, et al. , “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023. [Google Scholar]
- [7].Pratap Vineel, Tjandra Andros, Shi Bowen, Tomasello Paden, Babu Arun, Kundu Sayani, Elkahky Ali, Ni Zhaoheng, Vyas Apoorv, Fazel-Zarandi Maryam, et al. , “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023. [Google Scholar]
- [8].Lian Jiachen, Baevski Alexei, Hsu Wei-Ning, and Auli Michael, “Av-data2vec: Self-supervised learning of audio-visual speech representations with contextualized target representations,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023. [Google Scholar]
- [9].Ai Ooi Chia, Hariharan M, Yaacob Sazali, and Chee Lim Sin, “Classification of speech dysfluencies with mfcc and lpcc features,” Expert Systems with Applications, vol. 39, no. 2, pp. 2157–2165, 2012. [DOI] [PubMed] [Google Scholar]
- [10].Chee Lim Sin, Ai Ooi Chia, Hariharan M, and Yaacob Sazali, “Automatic detection of prolongations and repetitions using lpcc,” in 2009 International Conference for Technical Postgraduates (TECHPOS), 2009, pp. 1–4. [Google Scholar]
- [11].Esmaili Iman, Dabanloo Nader Jafarnia, and Vali Mansour, “Automatic classification of speech dysfluencies in continuous speech based on similarity measures and morphological image processing tools,” Biomedical Signal Processing and Control, vol. 23, pp. 104–114, 2016. [Google Scholar]
- [12].Jouaiti Melanie and Dautenhahn Kerstin, “Dysfluency classification in speech using a biological sound perception model,” in 2022 9th International Conference on Soft Computing & Machine Intelligence (ISCMI), 2022, pp. 173–177. [Google Scholar]
- [13].Kourkounakis Tedd, Hajavi Amirhossein, and Etemad Ali, “Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory,” in ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6089–6093. [Google Scholar]
- [14].Kourkounakis Tedd, Hajavi Amirhossein, and Etemad Ali, “Fluentnet: End-to-end detection of stuttered speech disfluencies with deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2986–2999, 2021. [Google Scholar]
- [15].Alharbi Sadeen, Hasan Madina, Simons Anthony JH, Brumfitt Shelagh, and Green Phil, “Sequence labeling to detect stuttering events in read speech,” Computer Speech & Language, vol. 62, pp. 101052, 2020. [Google Scholar]
- [16].Jouaiti Melanie and Dautenhahn Kerstin, “Dysfluency classification in stuttered speech using deep learning for real-time applications,” in ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6482–6486. [Google Scholar]
- [17].Howell Peter and Sackin Stevie, “A utomatic recognition of repetitions and prolongations in stuttered speech,” in Proceedings of the first World Congress on fluency disorders. University Press Nijmegen Nijmegen, The Netherlands, 1995, vol. 2, pp. 372–374. [Google Scholar]
- [18].Mohapatra Payal, Islam Bashima, Islam Md Tamzeed, Jiao Ruochen, and Zhu Qi, “Efficient stuttering event detection using siamese networks,” in ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. [Google Scholar]
- [19].Zhou Xuanru, Kashyap Anshul, Li Steve, Sharma Ayati, Morin Brittany, Baquirin David, Vonk Jet, Ezzes Zoe, Miller Zachary, Tempini Maria Luisa Gorno, et al. , “Yolo-stutter: End-to-end region-wise speech dysfluency detection,” Interspeech, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Cho Cheol Jun, Wu Peter, Prabhune Tejas S., Agarwal Dhruv, and Anumanchipalli Gopala K., “Articulatory encodec: Vocal tract kinematics as a codec for speech,” arXiv preprint arXiv:2406.12998, 2024. [Google Scholar]
- [21].Kim Jaehyeon, Kong Jungil, and Son Juhee, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540. [Google Scholar]
- [22].Yamagishi Junichi, Veaux Christophe, MacDonald Kirsten, et al. , “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019. [Google Scholar]
- [23].Shi Yao, Bu Hui, Xu Xin, Zhang Shaoji, and Li Ming, “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” 2015.
- [24].Gorno-Tempini Maria Luisa, Hillis Argye E, Weintraub Sandra, Kertesz Andrew, Mendez Mario, Cappa Stefano F, Ogar Jennifer M, Rohrer Jonathan D, Black Steven, Boeve Bradley F, et al. , “Classification of primary progressive aphasia and its variants,” Neurology, vol. 76, no. 11, pp. 1006–1014, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Gao Yang, Lian Jiachen, Raj Bhiksha, and Singh Rita, “Detection and evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 544–551. [Google Scholar]
- [26].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser L ukasz, and Polosukhin Illia, “Attention is all you need,” in Advances in Neural Information Processing Systems. 2017, vol. 30, Curran Associates, Inc. [Google Scholar]
- [27].Prenger Ryan J., Valle Rafael, and Catanzaro Bryan, “Waveglow: A flow-based generative network for speech synthesis,” ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, 2018. [Google Scholar]
- [28].Gulati Anmol, Qin James, Chiu Chung-Cheng, Parmar Niki, Zhang Yu, Yu Jiahui, Han Wei, Wang Shibo, Zhang Zhengdong, Wu Yonghui, and Pang Ruoming, “Conformer: Convolution-augmented transformer for speech recognition,” 2020.
- [29].Panayotov Vassil, Chen Guoguo, Povey Daniel, and Khudanpur Sanjeev, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210. [Google Scholar]
- [30].Howell Peter, Davis Steve, and Bartrip Jon, “The uclass archive of stuttered speech,” Journal of Speech Language and Hearing Research, vol. 52, pp. 556, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Lea Colin, Mitra Vikramjit, Joshi Aparna, Kajarekar Sachin, and Bigham Jeffrey P, “Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter,” in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6798–6802. [Google Scholar]
- [32].Li Xinjian, Dalmia Siddharth, Li Juncheng, Lee Matthew, Littell Patrick, Yao Jiali, Anastasopoulos Antonios, Mortensen David R, Neubig Graham, Black Alan W, and Florian Metze, “Universal phone recognition with a multilingual allophone system,” in ICASSP 2020. IEEE, 2020, pp. 8249–8253. [Google Scholar]
- [33].Lian Jiachen, Black Alan W, Goldstein Louis, and Anumanchipalli Gopala K., “Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition,” in Proc. Interspeech 2022, 2022, pp. 4686–4690. [Google Scholar]
- [34].Lian Jiachen, Black Alan W, Lu Yijing, Goldstein Louis, Watanabe Shinji, and Anumanchipalli Gopala K, “Articulatory representation learning via joint factor analysis and neural matrix factorization,” in ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. [Google Scholar]
- [35].Wu Peter, Li Tingle, Lu Yijing, Zhang Yubin, Lian Jiachen, Alan W Black Louis Goldstein, Watanabe Shinji, and Anumanchipalli Gopala K., “Deep Speech Synthesis from MRI-Based Articulatory Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 5132–5136. [Google Scholar]
- [36].Lian Jiachen, Zhang Chunlei, Anumanchipalli Gopala K., and Yu Dong, “Unsupervised tts acoustic modeling for tts with conditional disentangled sequential vae,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2548–2557, 2023. [Google Scholar]
- [37].Qian Kaizhi, Zhang Yang, Gao Heting, Ni Junrui, Lai Cheng-I, Cox David, Hasegawa-Johnson Mark, and Chang Shiyu, “Contentvec: An improved self-supervised speech representation by disentangling speakers,” in ICML, 2022. [Google Scholar]
- [38].Lian Jiachen, Zhang Chunlei, and Yu Dong, “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6572–6576. [Google Scholar]
- [39].Lian Jiachen, Zhang Chunlei, Anumanchipalli Gopala Krishna, and Yu Dong, “Towards Improved Zero-shot Voice Conversion with Conditional DSVAE,” in Proc. Interspeech 2022, 2022, pp. 2598–2602. [Google Scholar]
- [40].Choi Hyeong-Seok, Yang Jinhyeok, Lee Juheon, and Kim Hyeongju, “Nansy++: Unified voice synthesis with neural analysis and synthesis,” ICLR, 2022. [Google Scholar]
