. 2025 Aug 22;15:30938. doi: 10.1038/s41598-025-13902-7

Table 1.

List of past references, including datasets, methodology employed, limitations, and findings.

Ref.	Datasets	Methodology	Limitations	Results
⁷	• Dementia Bank, Pitt Corpus contains two hundred and ninety-two participants with five hundred and fifty-two audio recordings.	• Deep-Learning, Bidirectional Encoder Representations from Transformers (BERT) model	• Restricted applicability to many people because the study depends on particular datasets. The absence of external dataset assessment and real-time validation raises concerns regarding practical applicability. The performance of the model may be impacted by possible bias and overfitting.	• This model has an accuracy of 89.8%
¹¹	• Pitt Corpus is a part of the Dementia Bank Dataset. • There are around 488 selected sessions of the Pitt Corpus. • Prompt Database: There are around 496 session recordings.	• Gated Convolutional Neural Networks (GCNN).	• Limited generalizability because Pitt Corpus and PROMPT Database evaluations were conducted on particular datasets. The modest accuracy and inconsistent results with diverse voice data lengths raise concerns regarding robustness and practical application.	• Precision rate of 73.1%. • This accuracy rate increases to 80.8% when the entirety of the patient’s speech data is utilized.
¹⁴	• The dataset consists of 54 AD and 54 non-AD patients, while the test set includes 24 AD and 24 non-AD patients.	• BERT + Gated Self-Attention, LSTM, Ensemble Technique	• Limited applicability due to specific dataset characteristics, potential challenges in generalization to diverse populations, and complexities in interpreting the combined multimodal approach impact the broader adoption of the proposed methods.	• Accuracy and F1-score of 86.25% and 85.4%.
¹⁷	• The dataset consisted of 3245 pairs of audio recordings and the Dementia Bank Database	• Stacked Deep Dense Neural Network.	• Some potential drawbacks include findings unique to the dataset, difficulties with external validation, and the requirement for testing for real-world applicability. Because the study relies only on transcript data, it may miss more extensive contextual elements that influence the accuracy of Alzheimer’s prediction.	• Accuracy of 93.31%
²⁰	• The Pitt corpus dataset consists of three hundred and seven people with Alzheimer’s Disease and two hundred and forty-three healthy controls.	• Deep-Learning, KNN, RF, SVM, ANN,	• Limited dataset diversity, potential feature extraction bias, generalizability difficulties due to a single database, reliance on the suggested methodology’s effectiveness, and absence of real-time applicability.	• SVM accuracy is around 77%.
²⁴	• Dementia Bank Database. • It consists of 194 Dementia Patients, with 99 Control Patients.	• BERT Model. • Deep Learning-Based Multimodal	• Limited generalizability due to a single dataset, possible bias in the data collection process, ambiguous external validation, and a shortage of information regarding practical use. The emphasis may compromise model complexity on explainability.	• Accuracy is 90.36%.
²⁸	• Dementia Bank Database. • Two hundred eighteen audio recordings were considered from dementia participants, and 224 were from healthy control (HC) subjects, totaling 442 audio recordings.	• Machine Learning, CNN, ANN, RNN. • PRCNN (Parallel Recurrent CNN)	• There needs to be more generalizability due to the Pitt corpus being the only dataset used, potential bias in the data representation, lack of external validation, and potential overlooking of holistic linguistic characteristics due to the emphasis on individual speech variables.	• Accuracy is 85%.
Our Paper	• The study uses the Pitt Corpus, a vast collection of multimodal exchanges from the Dementia Bank database. The dataset offers a varied and well-chosen sample, comprising 104 controls, 208 dementia patients, and 85 people without a diagnosis.	• Multimodal Siamese networks	• Despite the remarkable 99% accuracy with which our multimodal Siamese networks could identify dementia from speech, our report shares some of the same issues as previous research. For instance, our dataset was skewed, and our findings might not apply in different contexts. Our work is distinct from others, though, because of the unparalleled accuracy we were able to get, which highlights the possibility of significant therapeutic implications in early diagnosis and intervention.	• The model demonstrates the efficacy of multimodal Siamese networks for dementia detection from speech in women, with an astounding accuracy of 99%. This high accuracy raises the possibility of valuable applications for early diagnosis and intervention in clinical settings.