Abstract
Speech foundation models (SFMs) achieve state-of-the-art results in many tasks, but their performance on elderly, multilingual speech remains underexplored. In this work, we investigate SFMs' ability to analyze multilingual speech from older adults using spoken language identification as a proxy task. We propose three key qualities for foundation models to serve multilingual aging populations: robustness to input duration, invariance to speaker demographics, and few-shot transferability in low-resource settings. Zero-shot evaluation indicates a noticeable performance drop for shorter inputs. We find that native speakers' speech consistently outperforms non-native speech across languages. Few-shot learning indicates better transferability in larger models.
1. Introduction
The proportion of adults over the age of 60 in the world's population is expected to rise from 12% to 22% between 2015 and 2050.1 To improve quality of life in this aging population and enable non-invasive bio-marking of neuro-degenerative disorders (Konig et al., 2018; D'Arcy et al., 2008), there is a pressing need to develop robust speech processing tools, capable of effectively handling speech patterns that are unique to this demographic. The difficulty is compounded in multilingual populations, such as in an Indian language environment, wherein a significant proportion of the population has spoken proficiency in at least two languages, requiring speech processing tools to handle multilingual scenarios alongside associated challenges of code mixing and code switching.
The elderly population is heavily under-represented in commonly available corpora (Ardila et al., 2019) for speech processing. Previous studies (Chen and Asgari, 2021; Vipperla et al., 2010) have found that conventional speech processing models trained on large-scale multilingual corpora exhibit performance degradation in automatic speech recognition (ASR) and language identification (LID) in speech of older adults, indicating that the models struggle to adapt to the characteristics of elderly speech. Speech foundation models (SFMs) such as Whisper (Radford et al., 2023) and Massively Multilingual Speech (MMS) (Pratap et al., 2024) have shown impressive capabilities in ASR and text to speech synthesis (TTS) (Chu et al., 2023; Barrault et al., 2023), underscoring the promise for robust multilingual speech processing. However, prior works have demonstrated their limitations in processing speech of underrepresented demographics, such as child (Jain et al., 2023; Attia et al., 2024; Fan et al., 2022), stuttered (Huang et al., 2024; Mujtaba et al., 2024), and pathological (Violeta et al., 2022; Müller-Eberstein et al., 2024) speech.
We postulate three desirable properties of SFMs that enable robust multilingual processing capabilities in specialized demographics: robust and short-term speech processing, invariance to different demographics, and few-shot learning ability. Importantly, to account for code-switching and code-mixing in multilingual contexts, models must process short-term speech segments. Speaker characteristics for a given language vary widely depending on the proficiency, accents, and individual speaking styles (Bai and Zhang, 2021). SFMs should account for these characteristics and provide equitable performance across demographic attributes. Additionally, the models should be able to adapt to specialized domains in low-resource settings in a few-shot manner, akin to large language models (Brown et al., 2020). In this work, we evaluate the ability of SFMs to process short-term speech from a multilingual aging population, using LID as a proxy task. Language identification (see O'Shaughnessy, 2024; Li et al., 2013) for an overview of recent methods) is a fundamental component of speech processing that facilitates efficient processing of multilingual speech.
Prior speech processing works in the aging domain have proposed various approaches to improve the robustness of models to processing elderly speech (Geng et al., 2022; Hu et al., 2024). However, the majority of these studies focus on a monolingual setting. This work represents one of the first attempts to study multilingual aging speech jointly, along with factoring in the impact of the language nativity of speakers. The contributions of this work are as follows:
-
•
We show that zero-shot spoken language identification for elderly speech using Whisper and MMS exhibits degraded performance when applied to short speech segments.
-
•
We analyze biases in the performance of SFMs across gender and language background (nativity) of the speakers. We find that performance for native speakers tends to be higher compared to non-native speakers across the multiple languages studied.
-
•
We probe interesting properties in SFMs, such as grouping unseen languages to their closest equivalent and their capabilities in few-shot learning. Both these properties are more effective in larger models.
2. Dataset
As part of a Harmonized Diagnostic Assessment of Dementia for the Longitudinal Aging Study in India (Khobragade et al., 2024), spontaneous speech samples were collected from participants over 60 years old in their native/fluent languages. Each participant was able to converse in at least two languages. In order to elicit spontaneous speech, the participants were asked to select a prompt from a list of eight well-known topics, such as a description of a festival, a short story from mythology, or a brief description of a movie plot. They were asked to speak for about 1–2 minutes on the selected prompt. Speech recordings were manually cleaned to filter noisy samples, interviewer speech, and ensure uniformity of language within a recording. In total, speech samples were collected from 20 Indian languages as well as English from 315 participants. Table 1 summarizes the distribution of durations and the number of samples for each language. The “Other” category refers to a combined group of languages comprising very few samples. The 12 languages mentioned in the table are primarily used for LID experiments in this work. Figure 1 shows the demographic details of the participants, including age, gender, and native language background, where L1 refers to native speakers and L2 refers to non-native speakers of a language.
Table 1.
Distribution of languages in the collected data. As, Assamese; Bn, Bengali; En, English; Gu, Gujarati; Hi, Hindi; Mr, Marathi; Pa, Punjabi; and Ur, Urdu belong to the Indo-Aryan Family. Kn, Kannada; Ml, Malayalam; Ta, Tamil; and Te, Telugu, belong to the Dravidian family. “Other” refers to a set of languages consisting of very few samples. The total number of unique participants in the study is 315.
| Language | Hi | Kn | Te | En | Mr | Ur | Bn | Ml | Ta | As | Pa | Gu | Other | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Num. Recordings | 145 | 99 | 86 | 50 | 57 | 47 | 28 | 23 | 15 | 27 | 15 | 10 | 44 | 646 |
| # L1/L2 Recordings | 35/110 | 32/67 | 41/45 | 1/49 | 38/19 | 36/11 | 19/9 | 20/3 | 15/12 | 6/9 | 7/8 | 8/2 | 36/8 | |
| Duration (mins) | 78.1 | 49.8 | 41.4 | 35.1 | 28 | 22.8 | 21.1 | 15.2 | 11.1 | 9.2 | 9.7 | 4.3 | 34 | 359.8 |
Fig. 1.
Distribution of demographic factors in the dataset. Left: gender; middle, language nativity; right, age (in years).
3. Experiments
3.1. LID models
To evaluate the spoken language identification abilities of SFMs in a multilingual aging population, we assess the Whisper (Radford et al., 2023) and MMS (Pratap et al., 2024) backbone models. As a baseline, we show the performance of an ECAPA-TDNN (Desplanques et al., 2020) speech brain LID (Ravanelli et al., 2021) model.2
Whisper formulates ASR as a multitask problem along with voice activity detection and LID as auxiliary tasks. Whisper models are trained on 680 000 h of multi-lingual corpora in 99 languages, curated from a diverse set of recording environments, and found on the web. They have shown state of the art performance for ASR and speech translation in multiple languages. In our evaluation, we use all the versions of Whisper i.e., tiny (T), base (B), small (S), medium (M), and large-v3 (L).
MMS aims to provide robust speech processing capabilities for the majority of the languages spoken across the world, with the current models having capabilities for 4017 languages. The models are based on wav2vec 2.0 (Baevski et al., 2020) architecture with pre-training on an in-house curated dataset with fine-tuned models for ASR, TTS, and LID. For all experiments in this work, we use MMS models fine-tuned for LID provided in the original work.
ECAPA-TDNN is a fully supervised baseline for spoken LID, trained on VoxLingua-107 (Valk and Alumäe, 2021) dataset. Unlike Whisper and MMS, which are transformer-based, this model is based on time delayed neural network (TDNN) architecture with a combination of ResNet blocks and multi-layer feature aggregation.
3.2. Robustness to short-term speech
Whisper models are trained on speech data padded to 30-s long segments (Radford et al., 2023), while MMS models are trained with varying lengths of speech segments (5.5–30 s) (Pratap et al., 2024) with limited exposure to short-term speech segments. Data from the 12 languages mentioned in Table 1 are evaluated for spoken LID in a zero-shot manner. To study the capabilities of short-term language identification, speech samples are segmented into non-overlapping speech segments of 5, 10, and 30 s. For each model, balanced accuracy and macro F1 scores are reported.
3.3. Demographic invariance
We consider three demographic attributes for assessing fairness in performance, i.e., gender, age, and the nativity of the speaker. Nativity refers to a speaker being a native (L1) speaker or non-native (L2) speaker of a given language. Two sample proportion tests between balanced accuracy metrics of two groups are compared to assess whether the performance of one is statistically significantly better than the other. Ideally, the performance should be independent of speaker demographics.
3.4. Few-shot learning
The objective of few-shot learning is to determine whether SFMs can transfer pre-trained language information to short-term language processing under resource constraints. Out of the 12 languages used in zero-shot, only languages with 20 min were included in fine-tuning. A dataset of 10 min per language was provided for training, with the rest being used for evaluation. Two fine-tuning methods, i.e., linear probing and parameter-efficient fine-tuning using low rank adaptation (LoRA r = 1) (Hu et al., 2021), are shown for Whisper-{T, S, L}. Models are trained for ten epochs with a learning rate {1 × 10−5, 1 × 10−4} for LoRA and linear probing experiments, respectively.
4. Results and discussion
4.1. Evaluating short-term speech capabilities of SFMs
Multilingual short-term speech processing capabilities of SFMs are evaluated through zero-shot LID prediction on non-overlapping input speech durations of 5, 10, and 30 s. We desire the models to show minimal degradation in performance when reducing the input speech duration from 30 to 5 s. Table 2 shows the LID results on L2 languages for Whisper, MMS variants, and ECAPA-TDNN model. We notice that the LID performance for Whisper and MMS drops significantly when the duration of the speech input decreases. This could be owing to both MMS and Whisper being trained on longer speech duration inputs, leading to limited short-term speech understanding capabilities. Notably, the ECAPA-TDNN model, which is a fully supervised model, shows no signs of performance degradation, indicating robustness to input speech length.
Table 2.
Zero-shot spoken language identification results across speech input durations 5, 10, 30 s. BAC and F1 scores are reported across 12 languages for two SFMs, i.e., Whisper and MMS. Best metrics per row are shown in bold.
| Input (Metric) | ECAPA | Whisper | MMS-LID | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TDNN | Tiny | Base | Small | Medium | Large | 126 | 256 | 512 | 1024 | 2048 | 4017 | |
| 5s (F1) | 48.6 | 19.1 | 24.6 | 33.5 | 39.5 | 42.1 | 42.7 | 39.5 | 40.1 | 38.5 | 39.5 | 40.6 |
| 5s (BAC) | 60.0 | 21.2 | 26.5 | 36.6 | 41.1 | 43.9 | 51.6 | 48.1 | 47.2 | 45.4 | 48.3 | 46.8 |
| 10s (F1) | 48.5 | 26.9 | 34.6 | 47.8 | 49.1 | 53.3 | 50.9 | 48.1 | 48.1 | 46.6 | 47.8 | 49.3 |
| 10s (BAC) | 59.8 | 29.9 | 37.2 | 51.6 | 51.8 | 57.0 | 62.7 | 59.8 | 58.8 | 56.4 | 59.3 | 58.6 |
| 30s (F1) | 48.3 | 35.9 | 50.9 | 56.5 | 54.3 | 58.6 | 56.6 | 54.9 | 54.8 | 49.8 | 52.4 | 55.0 |
| 30s (BAC) | 59.4 | 41.1 | 51.8 | 60.7 | 57.9 | 63.5 | 67.7 | 66.9 | 65.7 | 59.6 | 64.4 | 64.2 |
As the parameter count of models increase (Whisper-T: 39 M Whisper-L: 1.5B), an expected trend of performance gain is observed for all three input speech lengths. As the model size increases from Whisper-S to Whisper-L, a far larger performance improvement is seen in the case of 5 s speech segment duration [ 20% balanced accuracy (BAC)] as opposed to 30 s ( 5% BAC), indicating models with larger parameter count provide better short-term language recognition. For MMS, as the number of language classes increases in training, i.e., 126 4017, a decrease in performance is expected owing to the latent space of similar languages being closer. While this trend is observed from MMS-126 to MMS-1024, there is a surprising increase in performance from MMS-1024 to MMS-4017.
Whisper models can recognize 99 languages, which is far larger than the evaluation set of this work. For the speech samples where SFMs make errors in language identification, it would be preferred for predicted language to be within the same language family as the ground truth, i.e., predicting a Hindi sample as Marathi (Indian language) resembles better encoding of language information in the models, over predicting as Spanish (non-Indian language). Figure 2 shows the proportion of times each Whisper model predicts a non-Indian language. We observe two phenomena: as the model size increases, the proportion of non-Indian language predictions decreases, indicating better language information encoding in larger models. As the duration of the speech signal decreases, there is a consistent increase in the non-Indian language prediction ratio, indicating increased difficulty when inferring shorter durations. Additionally, the ECAPA-TDNN model has a lower proportion of non-Indian language predictions independent of input speech length, indicating better short-term language representations.
Fig. 2.
Proportion of speech inputs predicted as non Indian languages during zero-shot evaluation for Whisper models.
4.2. Language grouping for unseen languages in SFMs
Another desirable property in SFMs for multilingual speech processing is the ability to categorize unseen languages to their closest equivalent. Here, we use Odia (not present in Whisper training), as an example language. Bengali is the closest language to Odia (both being Eastern Indo-Aryan languages3) among the languages Whisper models are trained on. Figure 3 shows the proportion of Odia samples predicted as Bengali for Whisper-T and Whisper-L. For longer input speech segments, a high proportion of samples (> 60%) are predicted as Bengali for both models. However, in shorter input speech segments (5 s), Whisper-T shows a significant degradation with (78%) of samples being predicted as other languages such as English, Hindi, and Spanish, indicating decreasing language grouping ability as input speech duration decreases in smaller models (Whisper-T). This drop is absent in larger models (Whisper-L), suggesting better latent representations for multilingual speech. This may enable few-shot learning for unseen low-resource languages if they share linguistic and acoustic similarities with a language the SFM was trained on.
Fig. 3.
Proportion of unseen language (Odia) samples Whisper-{L,T} predict as Bengali, Nepali, or Others for {5, 10, 30 s} inputs, where Others includes all non-{Bengali, Nepali} predictions.
4.3. Do Speaker demographics affect LID performance?
Table 3 shows the overall performance of different models compared across nativity, gender, and age, which suggests native, relatively younger (60–70 years), and male speaker speech having better language recognition compared to non-native, older (>70 years), and female speakers on average. Owing to the characteristics of the study, while we are not able to compare across different age demographics, it is notable that we observe degradation in LID performance in older speakers (>70 years) compared to younger speakers (60–70 years). In order to better understand the trends for these attributes across languages, we consider pairwise comparisons with gender and nativity together, i.e., L1-male, L1-female, L2-male, and L2-female for the top three languages based on duration, i.e., Hindi, Telugu, and Kannada. Figure 4 shows the zero-shot performance of Whisper-T, Whisper-S, and Whisper-L for Hindi, Telugu, and Kannada, separated by nativity and gender of the speaker. Across the three models, we see that L1 speakers, irrespective of male or female, have higher performance compared to their L2 counterparts. This performance gap could be owing to higher proportions of atypical speech patterns in L2 speakers (van Maastricht et al., 2021), especially in the elderly population, resulting in increased difficulty during inference. When comparing the performance of male and female speakers, we do not see a clear trend of one gender having better performance across languages. Owing to the smaller sample size of the dataset, we do not find statistically significant differences between the demographic performance when performing a p-proportion test, despite consistent trends being observed for native vs non-native LID performance.
Table 3.
LID performance (BAC) aggregated over nativity, gender, and age for 5 s input segments. L1, native speakers; L2, non-native speakers; M, male; F, female, relatively younger speakers (60-70 years) and relatively older speakers ( 70 years). English was excluded due to a lack of L1 speakers.
| Group | ECAPA-TDNN | Whisper-T | Whisper-L | MMS-126 |
|---|---|---|---|---|
| L1 (native) | 67.1 | 21.9 | 54.5 | 59.0 |
| L2 (non-native) | 55.3 | 19.6 | 43.1 | 45.8 |
| Male | 66.1 | 21.9 | 45.6 | 53.0 |
| Female | 52.6 | 21.3 | 40.9 | 47.1 |
| 60–70 years | 66.2 | 23.5 | 47.3 | 54.2 |
| 70 years | 56.2 | 20.5 | 43.1 | 49.7 |
Fig. 4.
LID Balanced accuracy: comparing the performance for L1 and L2 speakers for 5 s input speech segments (left: Whisper-T, middle: Whisper-S, right: Whisper-L). Across all languages and models, we observe that L2 speakers have a performance drop in LID task over L1 speakers.
4.4. Are SFMs few-shot learners?
In this section, we investigate whether SFMs are able to adapt to short-term LID in an aging population, given limited speech samples per language. The objective of this investigation is to decipher whether the encoders of SFMs can distinguish languages under very low resource constraints. The performance on fine-tuning whisper encoders may not be better than the zero-shot setting. This is owing to the zero-shot setting predictions being obtained from the decoder, whereas in our fine-tuning setting, we use a projection head on top of the Whisper encoder alone. Whisper-{T, S, L} are fine-tuned on languages from Table 1 with greater than 20 min data, with 10 min per language (5 s input duration) used for training and the rest for evaluation. Results for linear probe (frozen encoder with trainable projector and classification head) experiments in Table 4 indicate marginally above chance performance (majority vote BAC = 14.2%) in Whisper-T, whereas Whisper-{S,L} demonstrate better few-shot transferability to short-term speech owing to better language information in the pre-trained embeddings. Additionally, being able to adapt intermediate layers of the models through LoRA provides performance gains over the linear probe setting. However, as explained previously, models fine-tuned in this setting do not perform better than the zero-shot models owing to the fine-tuning datasets being extremely small (∼10 min per language).
Table 4.
Results for fine-tuning Whisper-{T, S, L} using 10 min of training data per language for 5 s input speech. BAC is reported. Note: # params refers to the number of trainable parameters.
| Configuration | Whisper-T | Whisper-S | Whisper-L |
|---|---|---|---|
| Zero-shot | |||
| BAC | 45.5 | 51.1 | 54.1 |
| Linear probe | |||
| # params | 101 K | 203 K | 320 K |
| BAC | 20.8 | 27.9 | 28.9 |
| LoRA (r = 1) | |||
| # params | 29 K | 168 K | 741 K |
| BAC | 29.9 | 39.3 | 37.5 |
5. Conclusion and future work
In this work, we explore the ability of SFMs to recognize languages from a multilingual aging population. While we are able to identify some of the limitations in SFMs with respect to short-term speech processing in a multilingual aging population through language identification, the specific characteristics of the aging population, such as dialect variations, impaired voice quality, and cognitive decline, that could lead to the observed phenomena are still an open question. Additionally, data collected for some of the languages is extremely small (<10 min), preventing any statistical analysis of results in those cases. This was primarily owing to difficulties in eliciting spontaneous speech from the elderly population. As for future work, we aim to collect a larger speech corpus that could help investigate the factors leading to the discrepancies in performance for foundation models.
Acknowledgements
We gratefully acknowledge support from the National Institutes of Health (NIH) under Grant Nos. R01 AG051125, U01 AG064948, RF1 AG055273, and R01AG080473.
Footnotes
For more information, see https://who.int/news-room/fact-sheets/detail/ageing-and-health.
For more information, see https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa.
For more information, see https://en.wikipedia.org/wiki/Eastern_Indo-Aryan_languages.
Author Declarations
Conflict of Interest
The authors have no conflicts to disclose.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request. The data are not publicly available due to privacy restrictions.
References
- 1.Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M., and Weber, G. (2019). “Common voice: A massively-multilingual speech corpus,” arXiv:1912.06670.
- 2.Attia, A. A., Liu, J., Ai, W., Demszky, D., and Espy-Wilson, C. (2024). “Kid-whisper: Towards bridging the performance gap in automatic speech recognition for children vs. adults,” in Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 74–80. [Google Scholar]
- 3.Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Adv. Neural Inf. Process. Syst. 33, 12449–12460. [Google Scholar]
- 4.Bai, Z., and Zhang, X.-L. (2021). “Speaker recognition based on deep learning: An overview,” Neural Networks 140, 65–99. 10.1016/j.neunet.2021.03.004 [DOI] [PubMed] [Google Scholar]
- 5.Barrault, L., Chung, Y.-A., Meglioli, M. C., Dale, D., Dong, N., Duppenthaler, M., Duquenne, P.-A., Ellis, B., Elsahar, H., Haaheim, J., Hoffman, J., Hwang, M.-J., Inaguma, H., Klaiber, I., Kulikov, I., Li, P., Licht, D., Maillard, J., Mavlyutov, R., Rakotoarison, A., Sadagopan, K. R., Ramakrishnan, A., Tran, T., Wenzek, G., Yang, Y., Ye, E., Evtimov, I., Fernandez, P., Gao, C., Hansanti, P., Kalbassi, E., Kallet, A., Kozhevnikov, A., Mejia Gonzalez, G., San Roman, R., Touret, C., Wong, C., Wood, C., Yu, B., Andrews, P., Balioglu, C., Chen, P.-J., Costa-jussà, M. R., Elbayad, M., Gong, H., Guzmán, F., Heffernan, K., Jain, S., Kao, J., Lee, A., Ma, X., Mourachko, A., Peloquin, B., Pino, J., Popuri, S., Ropers, C., Saleem, S., Schwenk, H., Sun, A., Tomasello, P., Wang, C., Wang, J., Wang, S., and Williamson, M. (2023). “Seamless: Multilingual expressive and streaming speech translation,” arXiv:2312.05187.
- 6.Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). “Language models are few-shot learners,” Adv. Neural Inf. Process. Syst. 33, 1877–1901. [Google Scholar]
- 7.Chen, L., and Asgari, M. (2021). “Refining automatic speech recognition system for older adults,” in Proceedings of the ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7003–7007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chu, Y., Xu, J., Zhou, X., Yang, Q., Zhang, S., Yan, Z., Zhou, C., and Zhou, J. (2023). “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv:2311.07919.
- 9.D'Arcy, S., Rapcan, V., Penard, N., Morris, M. E., Robertson, I. H., and Reilly, R. B. (2008). “Speech as a means of monitoring cognitive function of elderly speakers,” in Proceedings of Interspeech 2008, pp. 2230–2233. [Google Scholar]
- 10.Desplanques, B., Thienpondt, J., and Demuynck, K. (2020). “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Proceedings of Interspeech 2020, pp. 3830–3834. [Google Scholar]
- 11.Fan, R., Zhu, Y., Wang, J., and Alwan, A. (2022). “Towards better domain adaptation for self-supervised models: A case study of child ASR,” IEEE J. Sel. Top. Signal Process. 16(6), 1242–1252. 10.1109/JSTSP.2022.3200910 [DOI] [Google Scholar]
- 12.Geng, M., Xie, X., Ye, Z., Wang, T., Li, G., Hu, S., Liu, X., and Meng, H. (2022). “Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition,” IEEE/ACM Trans. Audio. Speech Lang. Process. 30, 2597–2611. 10.1109/TASLP.2022.3195113 [DOI] [Google Scholar]
- 13.Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). “LoRA: Low-rank adaptation of large language models,” arXiv:2106.09685.
- 14.Hu, S., Xie, X., Geng, M., Jin, Z., Deng, J., Li, G., Wang, Y., Cui, M., Wang, T., Meng, H., and Liu, X. (2024). “Self-supervised ASR models and features for dysarthric and elderly speech recognition,” IEEE/ACM Trans. Audio. Speech Lang. Process. 32, 3561–3575. 10.1109/TASLP.2024.3422839 [DOI] [Google Scholar]
- 15.Huang, S., Zhang, D., Deng, J., and Zheng, R. (2024). “Enhanced ASR for stuttering speech: Combining adversarial and signal-based data augmentation,” in Proceedings of the 2024 IEEE Spoken Language Technology Workshop (SLT), pp. 393–400. [Google Scholar]
- 16.Jain, R., Barcovschi, A., Yiwere, M., Corcoran, P., and Cucu, H. (2023). “Adaptation of Whisper models to child speech recognition,” in Proceedings of Interspeech 2023, pp. 5242–5246. [Google Scholar]
- 17.Khobragade, P. Y., Petrosyan, S., Dey, S., Dey, A., and Lee, J. (2024). “Design and methodology of the harmonized diagnostic assessment of dementia for the longitudinal aging study in India: Wave 2,” J. Am. Geriatr. Soc. 73(3), 685–696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Konig, A., Satt, A., Sorin, A., Hoory, R., Derreumaux, A., David, R., and Robert, P. H. (2018). “Use of speech analyses within a mobile application for the assessment of cognitive impairment in elderly people,” Curr. Alzheimer Res. 15(2), 120–129. 10.2174/1567205014666170829111942 [DOI] [PubMed] [Google Scholar]
- 19.Li, H., Ma, B., and Lee, K. A. (2013). “Spoken language recognition: From fundamentals to practice,” Proc. IEEE 101(5), 1136–1159. 10.1109/JPROC.2012.2237151 [DOI] [Google Scholar]
- 20.Mujtaba, D., Mahapatra, N. R., Arney, M., Yaruss, J. S., Herring, C., and Bin, J. (2024). “Inclusive ASR for disfluent speech: Cascaded large-scale self-supervised learning with targeted fine-tuning and data augmentation,” arXiv:2406.10177.
- 21.Müller-Eberstein, M., Yee, D., Yang, K., Mantena, G. V., and Lea, C. (2024). “Hypernetworks for personalizing ASR to atypical speech,” Trans. Assoc. Comput. Linguist. 12, 1182–1196. 10.1162/tacl_a_00696 [DOI] [Google Scholar]
- 22.O'Shaughnessy, D. (2024). “Spoken language identification: An overview of past and present research trends,” Speech Commun. 167, 103167. [Google Scholar]
- 23.Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., and Auli, M. (2024). “Scaling speech technology to 1,000+ languages,” J. Mach. Learn. Res. 25(97), 1–52. [Google Scholar]
- 24.Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). “Robust speech recognition via large-scale weak supervision,” in Proceedings of the International Conference on Machine Learning, pp. 28492–28518. [Google Scholar]
- 25.Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J., Yeh, S.-L., Fu, S.-W., Liao, C.-F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, Y., De Mori, R., and Bengio, Y. (2021). “Speechbrain: A general-purpose speech toolkit,” arXiv:2106.04624.
- 26.Valk, J., and Alumäe, T. (2021). “Voxlingua107: A dataset for spoken language recognition,” in Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 652–658. [Google Scholar]
- 27.van Maastricht, L., Zee, T., Krahmer, E., and Swerts, M. (2021). “The interplay of prosodic cues in the L2: How intonation, rhythm, and speech rate in speech by Spanish learners of Dutch contribute to L1 Dutch perceptions of accentedness and comprehensibility,” Speech Commun. 133, 81–90. 10.1016/j.specom.2020.04.003 [DOI] [Google Scholar]
- 28.Violeta, L. P., Huang, W. C., and Toda, T. (2022). “Investigating self-supervised pretraining frameworks for pathological speech recognition,” in Proceedings of Interspeech 2022, pp. 41–45. [Google Scholar]
- 29.Vipperla, R., Renals, S., and Frankel, J. (2010). “Ageing voices: The effect of changes in voice parameters on ASR performance,” EURASIP J. Audio, Speech, Music Process. 2010, 1–10. 10.1155/2010/525783 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request. The data are not publicly available due to privacy restrictions.




