Multimodal Siamese networks for dementia detection from speech in women

Amel Ksibi; Ahlem Walha; Mohammed Zakariah; Manel Ayadi; Tagrid Alshalali; Nouf Abdullah Almujally

doi:10.1038/s41598-025-13902-7

. 2025 Aug 22;15:30938. doi: 10.1038/s41598-025-13902-7

Multimodal Siamese networks for dementia detection from speech in women

Amel Ksibi ^1,^✉, Ahlem Walha ^2,^#, Mohammed Zakariah ^3,^#, Manel Ayadi ^1,^#, Tagrid Alshalali ^1,^#, Nouf Abdullah Almujally ^1,^#

PMCID: PMC12373850 PMID: 40847098

Abstract

The critical need for early and precise detection of dementia, a crippling cognitive illness that primarily affects women, is addressed by this study. Global healthcare systems face enormous hurdles as dementia becomes more common. The need for non-invasive and effective alternatives is highlighted because current diagnostic techniques are frequently invasive, expensive, and imprecise. To address this issue, our work presents a unique method for female dementia identification from speech using multimodal Siamese networks. In contrast to earlier models, our approach uses both transcript and audio data, utilizing the complementary information present in both modalities. Improving dementia detection accuracy and reliability is the main driving force for this study, particularly in the early stages when intervention can be more successful. Additionally, the information used in this study includes 104 people in the control group, 208 people with a dementia diagnosis, and 85 whose diagnosis is uncertain. There are 238 control files and 298 dementia files in the audio dataset, and 243 control files and 306 dementia files in the transcript database. This extensive dataset makes it possible to evaluate our suggested model with confidence. Moreover, multimodal Siamese networks—a cutting-edge technique that captures relationships between multimodal data—are a part of the basic methodology used. Our model has a greater accuracy of 99% on the Dementia Bank Database, demonstrating considerable improvements over earlier approaches. The assessment parameters, encompassing an Area Under the Curve (AUC) of 0.99, bolster the efficacy of our methodology. This paper will improve at-risk individuals’ quality of life by developing non-intrusive dementia detection for early diagnosis and intervention.

Keywords: Dementia detection, Speech in women, Multimodal Siamese network, Audio-transcript data

Subject terms: Computer science, Neurological disorders

Introduction

Dementia, which involves the brain, poses a serious problem for public health because it affects many people with slow mental decline^1–3. Dementia affects the brain in many ways, such as causing memory loss, problems with problem-solving, and difficulties in communicating^4,5. As the world population ages, more people are being diagnosed with dementia, so finding new and effective ways to find it earlier is essential^6,7.

If dementia is identified early, interventions may be able to slow its progress and improve how the person affected feels^8–10. Just the same, figuring out if someone has dementia requires combining clinical assessments, brain scans, and cognitive exams^11,12. Standard methods for diagnosis can be hard to access, are often expensive, and many people are reluctant to go through invasive testing^13–15.

Recently, experts have been using non-invasive techniques to spot diseases early, and analyzing speech has turned out to be an up-and-coming method^16,17. Seeing how speech is a basic form of human communication, it reveals important information about the mind and might indicate if cognitive decline is beginning early^18,19, and²⁰. Being non-invasive, speech analysis draws interest for monitoring and testing lots of people who may get dementia^21,22.

When researching dementia, it is essential to pay close attention to women. Earlier studies suggest that women and men both experience dementia differently, which requires using unique techniques to treat and identify it^23–25. Given the variety of issues women with dementia deal with, including hormonal changes and societal factors, a specialist investigation is needed to understand more about dementia in this group^26–28.

This study paper aims to improve our knowledge of dementia diagnosis by describing multimodal Siamese networks as a new approach. Combining different methods, such as speech analysis, provides a detailed overview that helps make diagnostic instruments work more accurately^29,30. Siamese networks, which are part of the artificial neural network group, can perform various tasks, for example, pattern recognition and evaluating similarities^31–33. Working with these networks in a multimodal setup, we hope to expose little-noticed relationships in speech content that could indicate early dementia in women.

The significance of this project is found in two areas: inventing a remarkable tool for dementia diagnosis and addressing the special needs of women^34,35. We plan to discover any gender differences in mental decline and develop detection methods that focus more on women. This study’s outcomes might help healthcare professionals, researchers, and politicians by better explaining how to spot dementia and improving treatment methods.

We aim to provide new ways to help find dementia at an early stage. With this, solutions will be found that allow for better intervention and a better quality of life for people, primarily women, with this disease. We aim to contribute much to the global effort against dementia by blending innovative technology and neuroscience with an emphasis on gender.

Figure 1 represents a multimodal Siamese network model that takes speech as input and performs feature extraction, twin networks sharing weights, to predict the risk of dementia with gender-specific analysis on the cognitive health evaluation of women.

Fig. 1 — A Multimodal Siamese Network Framework for Detecting Dementia in Women via Speech Analysis.

The specific objectives of this research paper are as follows:

i.
A multimodal Siamese network that integrates voice and additional data is presented to more accurately detect dementia in women by spotting mild, gender-related cognitive trends.
ii.
The researchers customized the Siamese network to notice differences in women’s speech, boosting the accuracy and sensitivity of dementia detection for female patients.
iii.
A curated, marked dataset is made to teach and assess multimodal Siamese networks used in dementia detection, serving as a standard tool for testing and comparing different models.
iv.
Innovative approaches for capturing and merging details from different data types are suggested, giving the Siamese network a better chance to spot signs of dementia accurately.
v.
This study creates performance metrics for detecting female speech-based dementia so the model’s evaluation can be thorough, using sensitivity, specificity, and similar criteria for its target group.
vi.
The model is examined through clinical examples to demonstrate its use in early dementia diagnosis in women and determining how it might be applied in medical practice.

The arrangement used in the following sections fulfils a fixed order. The Section"Literature review"looks closely at essential research materials and summarizes them. The Section"Data collection"explores and discusses the datasets used to ensure everyone can understand what they include properly. This case section explains the strategies and approaches applied during the study. The Section"Model Evaluation"provides the findings gathered from the analytical stages, and then the Section"Discussion"describes these findings in detail. This section gives an organized summary of the key results. Since the study is orderly and logical, it presents the information on research, methods, findings, and conclusions meaningfully.

Literature review

Over the past few years, using machine learning and deep learning in speech analysis has greatly improved how dementia is detected^1,2. By discussing multimodal Siamese networks, this review attempts to combine and analyze the most recent studies on detecting dementia. Because these networks can handle numerous data types simultaneously efficiently, they can give better diagnostic results^3,4. We have pointed out studies involving women because factors related to gender could influence the accuracy of algorithms for detecting dementia. The women’s speech analysis framework concentrates on the structure of multimodal Siamese networks^5,6. The purpose is to help improve how dementia is diagnosed over time.

Effective diagnostic tools are needed due to the rising number of dementia patients. Hee Jeong et al. used the Dementia Bank and Pitt Corpus⁷. used 552 audio recordings from 292 persons. They used deep learning, namely the Bidirectional Encoder Representations from Transformers (BERT) model, to achieve 89.8% accuracy. However, the study’s reliance on specific datasets limits its applicability. Practicality is questioned without real-time validation and external dataset examination. Consider how bias and overfitting may affect model performance^8–10. This literature review examines current methods for automatically classifying dementia and emphasizes the need to address these challenges to improve diagnosis accuracy.

In addition, using gated convolutional neural networks (GCNN) to spot dementia through speech-paralinguistics may prove valuable in therapy. They used data from the Pitt Corpus and Prompt Database (488 and 496 sessions) to study this technique. Because the results were only obtained from these datasets, the fact that the precision is 73.1% is a cause for concern since the accuracy isn’t guaranteed for other data. The team achieved an 80.8% correct rate through all the patient voice recordings. Yet the fact that the approach is not very precise and does not provide reliable results with various voice recordings makes some question its usefulness and reliability^12,13. Such findings point out that the model needs to be refined further and that applying it to data beyond what was studied is hard.

Besides, new multimodal deep learning techniques are used in this study to tackle the problem of accurate and accessible detection of dementia. The approach of Ilias et al.¹⁴ was tested with BERT + Gated Self-Attention, LSTM, and Ensemble Technique on 54 AD and 54 non-AD patients. They did this, understanding the value of both talking and writing. Accuracy was shown by F1-scores of 86.25% and 85.4%. It also brings up the problems of using results in other contexts, applying them to a wide range of individuals, and understanding how to interpret the data from multiple sources. For this reason, more individuals struggle to use these approaches, proving that more robust measures are needed to keep these models reliable in every clinical situation^15,16.

Also, Alzheimer’s disease is a worldwide health challenge that calls for new methods of early detection. Khan et al.¹⁷ used transcriptions of patient speeches to classify cases of Alzheimer’s dementia using a stacked deep-dense neural network. It had a success rate of 93.31%. The model was supported by 3245 audio recordings and the Dementia Bank Database. However, the study’s unique findings indicate potential challenges, particularly in confirming the results with external sources and verifying their usefulness in real-life circumstances. Exclusively using transcript data raises concerns about missing contextual variables that may impair prediction accuracy^18,19This model has been successful, but future Alzheimer’s disease prediction systems must carefully analyze its limitations. This emphasizes the necessity of balancing accuracy and data.

Furthermore, more improved diagnostic procedures are needed due to the rising incidence of dementia. Yamanki et al.²⁰ used the Pitt corpus dataset of 307 Alzheimer’s patients and 243 controls. They investigated using SBERT to extract semantic characteristics to detect dementia. The study analyzed data using deep learning, KNN, RF, SVM, and ANN. SVM had a 77% accuracy rate. Still, there were huge issues, such as having only a few datasets, features that might need to be retrieved objectively, and using only one database, which made the method impossible to apply^21–23. The methodology’s effectiveness, reliance, and lack of real-time application were also highlighted. The study underscores the need for advanced dementia detection tools in clinical settings.

In addition, early dementia detection demands new methods due to its rising prevalence. Ortiz et al.²⁴ developed a deep learning-based multimodal architecture to predict dementia. They used the Dementia Bank Database, which includes 194 dementia patients and 99 controls. With the BERT model, they achieved 90.36% accuracy. Due to the single dataset, data collection bias, and lack of external validation, the findings may be limited in applicability. Prioritizing explainability may limit model complexity. Lack of actual application expertise restricts the model’s real-world applicability^25,26. Ortiz et al.‘s research advances deep learning for multimodal dementia prediction despite the limitations. Further research and refinement across numerous datasets are needed to improve dependability and usability.

Studies about detecting dementia using machine learning and deep learning are gaining interest in the healthcare sector. Kumar and others²⁸ employed 442 audio items from patients with dementia and healthy controls from the Dementia Bank Database. The PRCNN (Parallel Recurrent CNN) model achieved 85% accuracy by combining CNN, ANN, and RNN. Also, the Pitt corpus limited the study’s generalizability, introduced potential biases, and didn’t look at general language aspects. Because external validation is not possible, the investigations into dementia diagnosis need a vast quantity of information^29,30. These boundaries show the difficulties in making accurate and equally fair machine learning and deep learning models for identifying dementia from speech.

The following section, Table 1, includes a summary of the earlier studies, showing details like the datasets they tested, the approaches used, any drawbacks found, and the conclusions.

Table 1.

List of past references, including datasets, methodology employed, limitations, and findings.

Ref.	Datasets	Methodology	Limitations	Results
⁷	• Dementia Bank, Pitt Corpus contains two hundred and ninety-two participants with five hundred and fifty-two audio recordings.	• Deep-Learning, Bidirectional Encoder Representations from Transformers (BERT) model	• Restricted applicability to many people because the study depends on particular datasets. The absence of external dataset assessment and real-time validation raises concerns regarding practical applicability. The performance of the model may be impacted by possible bias and overfitting.	• This model has an accuracy of 89.8%
¹¹	• Pitt Corpus is a part of the Dementia Bank Dataset. • There are around 488 selected sessions of the Pitt Corpus. • Prompt Database: There are around 496 session recordings.	• Gated Convolutional Neural Networks (GCNN).	• Limited generalizability because Pitt Corpus and PROMPT Database evaluations were conducted on particular datasets. The modest accuracy and inconsistent results with diverse voice data lengths raise concerns regarding robustness and practical application.	• Precision rate of 73.1%. • This accuracy rate increases to 80.8% when the entirety of the patient’s speech data is utilized.
¹⁴	• The dataset consists of 54 AD and 54 non-AD patients, while the test set includes 24 AD and 24 non-AD patients.	• BERT + Gated Self-Attention, LSTM, Ensemble Technique	• Limited applicability due to specific dataset characteristics, potential challenges in generalization to diverse populations, and complexities in interpreting the combined multimodal approach impact the broader adoption of the proposed methods.	• Accuracy and F1-score of 86.25% and 85.4%.
¹⁷	• The dataset consisted of 3245 pairs of audio recordings and the Dementia Bank Database	• Stacked Deep Dense Neural Network.	• Some potential drawbacks include findings unique to the dataset, difficulties with external validation, and the requirement for testing for real-world applicability. Because the study relies only on transcript data, it may miss more extensive contextual elements that influence the accuracy of Alzheimer’s prediction.	• Accuracy of 93.31%
²⁰	• The Pitt corpus dataset consists of three hundred and seven people with Alzheimer’s Disease and two hundred and forty-three healthy controls.	• Deep-Learning, KNN, RF, SVM, ANN,	• Limited dataset diversity, potential feature extraction bias, generalizability difficulties due to a single database, reliance on the suggested methodology’s effectiveness, and absence of real-time applicability.	• SVM accuracy is around 77%.
²⁴	• Dementia Bank Database. • It consists of 194 Dementia Patients, with 99 Control Patients.	• BERT Model. • Deep Learning-Based Multimodal	• Limited generalizability due to a single dataset, possible bias in the data collection process, ambiguous external validation, and a shortage of information regarding practical use. The emphasis may compromise model complexity on explainability.	• Accuracy is 90.36%.
²⁸	• Dementia Bank Database. • Two hundred eighteen audio recordings were considered from dementia participants, and 224 were from healthy control (HC) subjects, totaling 442 audio recordings.	• Machine Learning, CNN, ANN, RNN. • PRCNN (Parallel Recurrent CNN)	• There needs to be more generalizability due to the Pitt corpus being the only dataset used, potential bias in the data representation, lack of external validation, and potential overlooking of holistic linguistic characteristics due to the emphasis on individual speech variables.	• Accuracy is 85%.
Our Paper	• The study uses the Pitt Corpus, a vast collection of multimodal exchanges from the Dementia Bank database. The dataset offers a varied and well-chosen sample, comprising 104 controls, 208 dementia patients, and 85 people without a diagnosis.	• Multimodal Siamese networks	• Despite the remarkable 99% accuracy with which our multimodal Siamese networks could identify dementia from speech, our report shares some of the same issues as previous research. For instance, our dataset was skewed, and our findings might not apply in different contexts. Our work is distinct from others, though, because of the unparalleled accuracy we were able to get, which highlights the possibility of significant therapeutic implications in early diagnosis and intervention.	• The model demonstrates the efficacy of multimodal Siamese networks for dementia detection from speech in women, with an astounding accuracy of 99%. This high accuracy raises the possibility of valuable applications for early diagnosis and intervention in clinical settings.

Open in a new tab

Data collection

This study uses speech data from the Pitt Corpus section of the Dementia Bank database. Dementia Bank consists of many research and communication modalities chosen for studying dementia^7,14,24. As part of the corpus, the “cookie heist” shows people performing the task, and all replies are represented in written and audio form. Because this dataset used spontaneous speech, evaluating verbal and cognitive-communication skills was simpler. You may request the dataset through a formal research request process.

Participants’ audio recordings were the only ones gathered for this study; transcriptions were not included. The dataset consists of a heterogeneous mix of participants: 104 control group members, 208 persons with a dementia diagnosis, and 85 individuals without a diagnosis. Focusing on people presenting with dementia and those with uncertain diagnostic status, this dataset enables a thorough investigation of speech patterns and cognitive performance in oral communication.

Data description

This study uses a carefully selected dataset from a supervised cookie theft experiment. The dataset consists of transcripts corresponding to audio recordings of participant-doctor exchanges, which are shown in Figs. 2 and 3. The two groups that comprise the dual-modal dataset are “control” and “dementia.” There are a variety of cognitive health states represented in these groupings. The audio files are encoded explicitly in the popular MP3 format, which provides a standardized and broadly compatible audio representation. Concurrently, the transcripts of the dialogues are formatted using the CHA file type, guaranteeing structural coherence for further examinations. The dataset can be reached through following link: https://dementia.talkbank.org/access/English/Pitt.html.

Fig. 2 — Distribution of Audio Dataset Classes.

Fig. 3 — Distribution of Transcript Dataset Classes.

The dataset includes 397 participants: 104 controls, 208 individuals with dementia, and 85 with unknown diagnosis.

Furthermore, the following is the composition of the dataset:

i.
Audio collection

Two hundred thirty-eight audio files are under control.
Dementia: 298 sound recordings.

ii.
Transcript collection

Two hundred forty-three transcript files are under control.
Alzheimer’s disease: 306 audio files.

This distribution offers a fair portrayal of the control and dementia conditions, allowing for a thorough analysis of the experiment’s results. Every audio recording and transcript annotation is distinct, adding to the dataset’s diversity and richness. Incorporating both textual and audio transcripts makes a study with multiple perspectives possible, providing information on both the spoken and written parts of the participant-doctor interactions. The careful selection and thorough information labelling offer a solid basis for the study project’s conclusions and further analytical efforts.

This research includes a cohort of male and female participants aged 46 to 90. The collection consists of detailed data about people who have been diagnosed with dementia as well as a control group made up of cognitively healthy people. The dataset is noteworthy for having a quantitative sample breakdown that considers the possibility of numerous recordings for a single subject. The samples are primarily audio recordings that were gathered from each participant. Interviews are performed using a visual modality using the “cookie-theft” incident from the Boston Diagnostic Aphasia Examination, which is graphically portrayed with several components.

The participants are interviewed in English, and a picture of the described condition is shown to help them investigate it closely. The words people say in the interviews are filmed, and their statements are transcribed using a standard approach to prepare for analysis. Significantly, the dataset concentrates on women, which makes it easier to explore how gender affects the experience of dementia. This way of doing things ensures that all data is accurately and systematically gathered and is in line with the goals of the research, which are to study how cognition functions with dementia through verbal statements and visual observations.

We will release our code for everyone to use by using different methods, such as Siamese networks, MFCC, and wavelet features. We have checked our model on other open-source datasets, achieving high results.

Data visualization

In this study, we looked closely at how the data is distributed and its essential properties by combining data exploration with thorough data collection. Also, these graphs help in carefully studying the data structure, noticing any unbalanced classes, and spotting problems that could come up in modelling. This step aimed to understand the dataset’s main traits, mainly how the classes were distributed and if there were any problems to address before further modelling. Because the research aims to be helpful for all relevant stakeholders, the detailed visualizations presented are meant to support sound decision-making. They provide information and analysis, helping to see the dataset’s details and spot significant changes or irregularities. It provides a strong basis for the rest of the analysis because it guarantees that all the data is analyzed carefully and precisely.

Analysis of audio waveforms

i.
Data overview

Audio Data: The dataset’s primary focus is on analyzing the waveforms in audio data, mainly focusing on Fig. 4. It gives a detailed look at the time and amplitude features found in audio data from many different project databases.
Comparison of Signal Lengths: The study compares the lengths of all the audio samples from different classes to observe if there are any differences. It is necessary to understand these differences for the study’s objectives, which this comparison analysis achieves.
Label-Based Categorization: Labels are assigned to the waveforms, representing their condition, so Control or Dementia may sort them. Using this classification, it is easier to spot patterns in the structure of waveforms from different groups.

Fig. 4 — Waveform of sample audio file for Dementia Patient.

ii.
Procedure for extracting data

Based on Labels Categorization: Classifying audio files by labels is essential to extracting a dataset. Expressions such as “Dementia” and “Control” are used, which help to define what makes an environment healthy or unhealthy. After that, more analysis is carried out using the same labelled data.
Study of Amplitude and Timing Properties: Fig. 4 contains a detailed examination of amplitude and timing features that lead the way in extracting information from the signals. Finding essential characteristics from the audio waveforms to properly sort and understand the dataset is important.
Examination of Duration Comparison: The extraction method includes an in-depth study of how long signals are in various classes. At this point, we gain information on how the dataset works by finding out whether certain groups speak differently.

iii.
Filtering data to exclude irrelevant or noisy information

Precise Labelling: Labelling every audio sample properly ensures correct classification. When the groups are separated, it helps prevent research from being filled with unnecessary information and confusion.
Outlier Exclusion: Outlier signals with unusual lengths are discovered and discarded during data filtration. Because noise or other disturbances are removed here, typical waveform patterns are better represented more clearly.
Diversity of instances: Having both cases in the set is essential to make comparison possible. Having various recordings in a study improves its reliability and supports detecting different audio waveforms.
Validation of Categorization: A complete study should ensure that the data grouping process removes mislabelled or misclassified information to prevent errors in the final result. The outcomes of this process strengthen and ensure the validity of the dataset as it will be used further.

This study used data collection by examining how the amplitude of sounds changed with time and graphically displaying it on the y-axis. Changes in the audio data are illustrated in Fig. 5, which also shows the loudness of the samples. Several courses were examined to discover which amplitude aspects might help spot changes in people’s speech. It gives an early alert to possible microscopic acoustic details that can later change the outcome of analyses and models.

Fig. 5 — Waveform of sample audio file for Control participants.

By offering visualizations, users can grasp the main attributes of the dataset more easily. It is easier to make sense of the data when audio waveforms are shown graphically and look neat. The primary responsibility of this step is to find patterns, abnormalities, and differences in acoustic aspects that may affect future analysis. Carefully studying the signal variations helps compile a solid data set that supports detailed examinations and leads to better choices in the following research stages.

Single bark wave plot

A single audio stream is displayed as it plays in Fig. 6. A particular range of frequencies connected with each spot on the auditory spectrum influences a different feature of human hearing. By turning sound into these graphs, the main frequency features in the music become easier to identify. A graph of the frequency response displays how energy is divided among the different bark bands in the audio signal. It gives an initial idea of the audio’s timbre and the likely acoustic characteristics.

The “Single Bark Wave Plot” quickly shows the important frequency characteristics of an audio source. This method makes it easier to spot certain trends, errors, and specific frequencies that may matter for the later extraction of features in the audio. This way of displaying data is handy for the early stages of inquiry since it quickly allows experts to look at the spectrum of the sound.

Multi-Bark sound spectrogram

Figure 7, referred to as the “Multi Bark Sound Spectrogram”, clearly shows the time-based description of a signal’s overall spectral qualities. The function uses Librosa and matplotlib to illustrate what frequencies are present in the sequence. The representation allows a clear view of the way the frequency components of the audio shift over time. Using the short-time Fourier transform (STFT), an audio stream is separated into frequencies at short intervals. The entity demonstrates the changes in audio over time and bandwidth. The STFT measurements are changed into decibel values (dB) through extra processing with a logarithm. Because of this, the less intense sound parts become clearer, with more noticeable spectra. The colour spectrum helpfully represents how much each frequency component in a signal’s power changes over time. Colour intensity reflects amplitude: shiny, bright hues are related to high amplitudes and dull, dark tones to low amplitudes. The spectrogram helps researchers spot transient noises, steady sounds, and harmonic structures, as it displays the frequency variation during a recording.

Using Frequency Analysis, you can notice audio details such as pitch variations, harmonics, and the way the spectrum of sound changes over time. It is essential to analyze detailed sound signals, identify important occurrences, and separate notable elements to be examined further. Visualization applications encompass sound detection, audio processing, and music analysis and are used in the field.

Mel-frequency cepstral coefficients (MFCC)

The research also investigates using Mel-frequency cepstral coefficients (MFCC) for different purposes in audio analysis. MFCC helps preserve what makes an audio source different and worth listening to. You do this by getting the short-term power spectrum from the sound, which facilitates studying its unique audio patterns. MFCC coefficients are essential because they are shown visually with background colors. Figure 8 explains this diagram with the help of a colour bar, which links the brightness of a colour to the magnitude of MFCC coefficients.

Fig. 8 — MFCC Coefficients Visualization for Sample Audio.

Regarding data collection, the study examines the relationships between comments and mentions. Using this approach means researchers must study a lot to achieve exactness and relevance. Certain conditions are applied when extracting data to ensure only the needed data remains and make the dataset sturdier. These requirements are context relevance, semantic coherence, and syntactic congruence. The objective is to eliminate unnecessary information to improve how accurate and related the comments and mentions from the dataset are. The method makes the research findings more accurate and dependable.

The MFCC (Mel-Frequency Cepstral Coefficients) method is popular for extracting features from audio signals. This technique compresses the frequency features of small audio parts better than other methods. MFCC records key information from an audio signal’s spectrum and how it changes over different input parts.

MFCC coefficients show the characteristic makeup and variations in audio spectra. The way the coefficients evolve shows you the distinctive traits in the audio’s sound. MFCC sets how many dimensions the feature vector has which is typically between 13 and 40. A coefficient in EQ is a measure of an audio spectrum.

Many uses of audio analysis show the flexibility of MFCC features. Some possible uses are showcasing speech recognition, grouping music by genre, and finding sound events. These traits are used to record and examine important details of an audio signal.

With a visualization tool, researchers can pick which features to use and how to train their models. Analysts can use this tool to check how the MFCC coefficients change over the sound signal more efficiently. Representing the MFCC coefficients allows researchers to notice traits and trends in audio data, which helps categorize and analyze information in many cases.

Using Mel-Frequency Cepstral Coefficients helps describe the spectral content of an audio source. There are many situations where seeing information visually is important, and here, we focus on how the different frequencies help develop the sound stream. Surface plots in audio signal processing allow quick extraction of main features and prepare the rest of the analysis.

Zero crossing rate

In Fig. 9, known as the Zero Crossing Rate, audio streams are represented by lines that indicate the occurrence of zero crossings throughout them—a high zero-crossing rate signals sharp changes in the volume of the recorded signal. Many polarity shifts happen quickly if the zero-crossing rate within a region is high. Audio signal segments that show noticeable changes in loudness, which can sometimes include noise or percussive rhythms. Where the audio waveform crosses the zero line a few times, the sound remains the same. Such regions are usually accompanied by sustained tones or by noise segments where the level does not change much. The zero-crossing rate shows what the audio stream sounds like. Rates that are lower than others usually indicate easy-flowing music, whereas higher rates often prove that the music is choppy and contains more sudden changes.

Also, the zero-crossing rate is helpful in applications that include pitch estimation, finding instruments by type, and identifying sounds. With this visualization method, people can distinguish parts of the audio that changed over time, which advances our knowledge of the temporal aspects of the signal. Zero Crossing Rate marks where an audio signal makes this change from having positive polarity to negative polarity. The way polarity is emphasized in correcting the phase reveals the moving patterns within the signal. Researchers can examine and organise audio signals, using zero crossings on the graph as a sign to find sudden changes in the sound’s amplitude.

Root mean squared energy & log power spectrogram

The “Root mean squared energy and log power spectrogram” chart combines two helpful audio analysis tools: the root mean squared energy (RMSE) and the log power spectrogram. It lets you see how the frequencies and energy in the audio are arranged. The energy of the audio is reported for every temporal time slice. The shape attribute helps to define the structure of the root mean square error (RMSE) array. The graphic gives the number of periods and displays the energy changes during those periods.

The root mean square error (RMSE) graph in Fig. 10 shows how the signal’s energy changes over time. Peaks in the curve usually show where there are intense musical moments, forceful sounds, or strong emotional sections. It shows the energy in the frequency bands as a function of both time and frequency. Bright regions show an increase in energy, and dark areas show a decrease. RMSE energy and logarithmic power spectrogram add equal value when considered together. It is possible to find the connection between energy peaks and unique spectral features by watching for similarities in peaks in the RMSE curve and in the logarithmic power spectrogram.

Word cloud of pre-processed transcripts

The created Word Cloud depicts the visual representation of the pre-processed transcripts from the dataset, as shown in Fig. 11. The magnitude of each term in the word cloud indicates its frequency of occurrence in the transcripts. The transcripts exhibit a higher frequency of complex vocabulary usage while demonstrating a lower frequency of employing more straightforward phrases.

Using the Word Cloud facilitates the visual identification of frequently employed terms, providing insights into the language patterns observed within the transcripts. Based on the graphical representation provided above, it is evident that the displayed words could be more coherent and meaningful, indicating a lack of correct linguistic expression. This observation suggests that the patients under study were unable to articulate speech. The term of interest in this context is “cookie,” indicating that numerous patient conversations centred around a potentially ambiguous cookie narrative that necessitated textual and auditory clarification.

EDA analysis

The RMSE is a standard measurement for speech and music to assess energy and identify quiet and noisy areas in sound. A log spectrogram can be used to find harmonic patterns, sudden events, and variations in the spectrum. To analyze the energy levels and spectral information in an audio stream, Root Mean Squared Energy & Log Power Spectrogram is used. A detailed analysis of audio data involves plotting the root mean square error (RMSE) and logarithmic power spectrogram side by side. This experiment allows one to study how changes in energy and spectra are related.

In the study “Multimodal Siamese Networks for Dementia Detection from Speech in Women,” EDA visualizations present all details and characteristics of the involved audio data. The images in the figure give us fresh ideas about the audio data. Changes in the waveform plots demonstrate possible adjustments in volume, changes in speaking patterns, and variations in the background noise. This plot highlights how frequencies are grouped and shows the transitions between ranges. The Multi-Bar Sound Spectrogram captures and documents the varying frequency shifts that occur over time. The Zero Crossing Rate and Spectrum Centroid metrics demonstrate signal characteristics and pitch fluctuations, while the MFCCs metric reveals spectrum features. Finally, examining the integrated root mean squared energy and logarithmic power spectrogram reveals distinct spectral and energy patterns. When considered in their entirety, these visualizations provide insights into essential properties of the dataset, thereby building a foundation for making well-supported decisions on feature extraction and modelling in research on dementia diagnosis.

Audio preprocessing

The comprehensive steps involved in data preprocessing for the audio-based dataset (As shown in Fig. 12) are as follows:

Audio File Loading: The process commenced using the Libros. Load function to load individual audio files. The phase facilitated the availability and preparation of the raw audio signals for feature extraction²⁷.
MFCCs Coefficients: Each audio file was analyzed before extraction, and MFCCs were calculated. The coefficients described in this study encompass the spectral and temporal aspects of the audio, which are fundamental acoustic characteristics significant for future research endeavours²⁷.
Consistent Time Steps: We set a fixed number of time steps so that all MFCC feature matrices would be uniform in the dataset. Matrix rows were sometimes reduced if they had too many or too few steps, so zeros were added to bring the number up to a standard length²⁷.

Data Augmentation: This requires carefully controlled changes to the initial MFCC features. Matrices of Mel-frequency cepstral coefficients (MFCC) were duplicated and modified by applying controlled amounts of random noise.
Combining Augmented Data: By combining the augmented MFCC features with the original features, a better dataset was formed. This integration improved the model’s ability to handle issues found in real audio and generalize its knowledge.
Dataset Splitting: The enriched dataset was divided into separate sets for training and testing using the train_test_split function. 20% of the data was used to build the testing set, and the remaining 80% was used to train the Multimodal Siamese Network model.

Collectively, these comprehensive procedures facilitate the preparation of the audio-centric dataset for subsequent modelling. One potential approach for identifying dementia through speech analysis involves training a model using processed data that has been normalized, enriched, and diversified.

Transcript-based dataset preprocessing

The following section outlines the comprehensive procedures involved in preprocessing data for the dataset, explicitly focusing on transcripts, as shown in Fig. 13.

Fig. 13 — Transcript-Based Data Preprocessing.

Stop Words and Punctuation Removal: Common stop words and punctuation were removed from the text to produce clearer and more meaningful transcripts.
Dataset Setup: The dataset’s organizational structure was established by specifying a base directory with subfolders for the “Dementia” and “Control” categories. Consequently, the compilation of subfolders is formed.
Lists for Transcripts and Labels: The text for labels and transcripts is changed to lowercase, and any characters that are not letters, a lot of white space, or punctuation are removed. To improve the clarity of the text representation, tokenization, stop word removal, and lemmatization were performed.
Text preprocessing: Detailed steps include replacing words with their synonyms as part of text pre-processing. The synonym_replacement program used a random process to decide which terms should be replaced, making the dataset richer and more diverse. New material is added to every transcript to improve the dataset.

Data augmentation: The technique of using synonyms to enrich training data is well known. The tool randomly replaces words in domains, increasing the dataset’s richness and diversity. To make the dataset bigger, further adjustments were made to each transcript, such as adding extra information.
Preprocessing of transcripts: The first part of pre-processing the transcripts was reading the CHA files, which stored the original contents. After that, text pre-processing processes the content for further analysis. Besides that, a synonym substitution feature is utilized to enhance the language in the processed content. Improved transcripts are added to the existing list, and every transcript is given a proper category by looking at its subfolder.
Label Encoding: Label Encoding converts category labels into numerical values to train the model.
Tokenization and Padding: The texts underwent tokenization and padding processes to ensure compatibility with the model and maintain consistent sequence lengths. A tokenizer converts the complete set of transcripts into numerical sequences.
Dataset Splitting: After pre-processing and enhancing the transcript data, the dataset is split into training and testing subsets using the train_test_split tool. The time allocation is 90% for training and 10% for testing.

The aforementioned comprehensive techniques are employed to pre-process the dataset based on transcripts to prepare it for training and evaluation in the Multimodal Siamese Network model. The processed data consists of normalized and enriched transcripts and also includes labels indicating the presence of dementia.

Multimodal Siamese networks

Multiple-modal Siamese networks have greatly improved deep learning, especially when examining and comprehending data. The base of the Siamese network is enhanced by including, in addition to image data, text, audio, graphics, and different kinds of data. Applying these techniques together leads to better performance by Siamese networks. It handles and unites information from various places, helping to enhance the basic ideas of Siamese networks. These networks are good at handling input data effectively and collaboratively. Because of this feature, operations improve, and users can obtain valuable insights from various tools. Introducing new techniques to the Siamese network lets us process information more thoroughly, benefiting its use in multiple industries.

The following expression shows the mathematical Model of Multimodal:

‘m’ is the number of combined modalities.
‘L’ is the number of classes.
‘K’ is the number of Multi-Modal data point pairs.

Modalities in this case are two Siamese network models for audio and text data classification.

For feature extraction from text using a dense layer is as:

‘H’ is the number of hidden units in the output dense layer.

For feature extraction from audio using a dense layer is as:

‘H’ is the number of hidden units in the output dense layer.

A fused representation is produced by concatenating the outputs of the text and audio models:

# Pseudo-Code for Multimodal Siamese Network Model.

Pseudo-Code for the multimodal Siamese network

With multimodal Siamese networks, it is simple to understand how different sorts of data connect and interact. As you can tell from Fig.14, the Siamese network is explicitly created to address the unique problems each type of data faces. This special design ensures the network can combine, bring together, and make a joint representation of the data. What is sought is to take out valuable attributes from a wide range of inputs in order to handle modalities.

In addition, its capacity to process information from different sources helps the network see how different parts of the problem relate. Training the Text Model Branch using LSTM parameters such as hidden units, embedding dimensions, and the maximum number of words is significant. The network layer of hidden audio units allows the model to extract essential details from the audio input data.

Unlike the previous layers, the output layer mixes Audio Dense and Text model branch results to make a new representation that retains each input’s richness and exposes hidden information and connections. This combination of images shows how examining inputs together solves the problems of focusing on each modality alone. The Siamese structure helps us see and understand how data is connected in several ways at once, unlike when using a single feature.

Figure 15 illustrates that Multimodal Siamese Networks (MMSN) have a subnetwork with a detailed structure. Its flexibility means it can be applied in different domains, for example, cross-modal information retrieval, sentiment analysis, interpreting audio-visual scenes, and medical care. Its functions are essential in schools and in real-life situations, as used by experts. Because of multimodal Siamese networks, artificial intelligence is achieving significant results and understanding in several domains. Networks of this kind overcome the weakness of single-modal techniques by gathering data from various streams simultaneously. Using several data sets simplifies understanding and boosts the strength and power of any data-driven innovations. In essence, multimodal Siamese networks open up new possibilities for making flexible AI systems that meet the complicated demands of many information areas.

Fig. 15 — Detailed Architecture of Subnetworks for MMSN.

Methodology

It is suggested that having multiple inputs in a Siamese network improves the ability to detect dementia in speech patterns from women. This way of planning makes you likely to devise a creative answer to the related issues. It works well because data is handled precisely, models are built carefully, and it is thoroughly evaluated^24,33.

Ensuring data integrity and especially preventing data leaks is a key element in the methodology. We constantly ensured that the training, validation, and test sets had no common or replicated data. It was confirmed that all training data, but no test data, was used while fitting the model. The data was broken up using a fixed random seed to encourage reproducibility, which is important in scientific work.

Using audio features is an important process when getting data ready for the task. With merging, all samples form one complete dataset, which gives the Siamese network information from multiple sources to use for learning. The Siamese network was structured so that all branches saw corresponding information from the various inputs, which helped it detect more subtle signs of dementia.

Consistency among the word texts was maintained by using tokenization and padding on all data, and 10,000 common words and 500 topic words were included in the vocabulary. It comes from having to decide between how much the model can do and how efficiently it can do it.

All sound samples were evenly divided between the training and testing data. Doing this avoided unfairness and kept away from representing specific segments more than others. Multi-modal data (including audio and text) was easily integrated into the Siamese structure, using the Keras functional API. The text modelers also carefully plan levels in the text structure. With a vocabulary size of 10,000 and an embedding dimension of 100, the embedding layer looks for connections in the text. The main reason for including a 64-unit LSTM is to successfully detect how different parts of the sequence depend on each other, which is vital for voice context. Since it handles non-linearities very well, the ReLU activation function is used. A densely linked layer with 64 units is then used for feature extraction.

Similarly, the audio model’s architecture emphasizes picking out essential features from the audio. The flattened audio values are passed through dense layers with 128 and 64 units, which benefit from batch normalization and dropout layers to ensure stability. Because of these decisions, the network can better spot tiny audio changes that could indicate dementia.

Also, applying multiple teaching styles is a key strength of the proposed method. Having 128 dense layers helps the model merge features and shrink the data size to use information from both text and audio inputs. Implementing batch normalization and dropout methods in the model creates resistance to changes and overfitting. The network is more effective at making diagnoses because fusion includes information from multiple scans²⁴.

In addition, the classification layer is carefully set up to distinguish between patients who have and do not have dementia. The final dense layer, ReLU, and extra batch normalization and dropout layers are used to regularize the model. Since the classification problem is binary, the output neuron has a sigmoid function and acts alone in the output layer.

It is constructed and trained with the Adam optimizer, a learning rate of 0.001, and binary cross-entropy loss. This optimizer and loss function are appropriate for binary classification tasks, since they make the model converge fast and help provide precise gradient information. The model is trained 50 times, using 20% of the data for validation, to equalize how much it can withstand mistakes and do well on new data.

Also, many different measures are used in the evaluation step. Calculating the loss and accuracy of the trained multimodal model lets you see how well it performs in general. Examining accuracy, recall, and F1-score provides more detailed information about what the model can and cannot do. A confusion matrix helps you see how your predictions turn out, which lets you identify which areas require improvement.

ROC curve analysis helps see the kind of results expected from the model in telling different groups apart. Analysing how well the model separates people and checking its value for use in clinical situations is very important.

All in all, using Siamese networks for identifying dementia stands out because it involves thorough data preparation, a designed model structure and a wide range of assessment techniques^2,24. The model relies on a combination of spoken and written ways of representation to collect a detailed account of dementia-related speech in women. Picking the appropriate architectural components and standards allows for accurate and strong assessment of the model which then makes it a strong candidate in the world of multimodal diagnostics for neurodegenerative diseases.

Figure 16 displays the main design of the MMSN (Multimodal Siamese Network) model, showing each step needed to form the model. How well the model works is detailed through key phases, for example, preparing the data, training the model, and testing it with plenty of metrics and images. It is possible to recognize how well the model performs and the level of its category distinction by working with all these assessment methods. It describes a systematic strategy for combining textual and sound data in one multimodal Siamese network. The way to implement the approach is set out, giving a clear guide. Using inputs from several different sources by the network leads to better classification ability for binary jobs.

Fig. 16 — MMSN Model Design Methodology.

Model architecture design

Multimodal Siamese Network for Dementia Detection from Speech in Women (MMSN) is designed to identify dementia using information from women’s voices. It presents a leading way to handle this significant task. The system works with sounds and texts simultaneously to fully understand the many patterns linked to dementia. Every detail in the architecture has been planned, using the best techniques from natural language processing and processing audio signals to enhance the model’s forecast skills, as seen in Fig. 17.

Fig. 17 — Architecture of the proposed MMSN Model.

i.
Text model

An input layer receives all the pre-established text sequences as the first step in the MMSN’s text model. After that, text sequences are given to an embedding layer that turns them into shorter and simpler vectors. Embedding in CNNs helps the model notice words’ distinct meanings in their sentences. Sequential connections and finding complex patterns are possible using a Long Short-Term Memory (LSTM) layer. Because LSTMs can model long-term relationships in data, they perform well on processing sequential data. After that, other layers are used to look more closely at what the LSTM revealed and find any extra hidden details in the text.

ii.
Acoustic model

The audio model uses low-resolution audio components to record complicated acoustic parts. 128 ReLU units in a strongly linked layer are used to find critical auditory features. The layer uses batch normalization to keep the output stable, with dropout introduced as a method to stop overfitting. The model picks up various sounds and speech features within the next layer, which has 64 units and uses ReLU. Using dropout and batch normalization improves the audio clarity and makes the audio model more robust.

iii.
Multimodal fusion

The process of Multimodal Fusion is to unite text, images, and audio information into one form. Multimodal fusion combines what the text and audio models produce to represent the data completely. As a result, the model understands the combined effects and has a thorough picture of the basic patterns that affect dementia. When different data sources are used, the model’s predictions get stronger because it can look at more factors.

iv.
Dimension reduction and feature extraction

When multimodal fusion is finished, the MMSN applies dimension reduction and feature extraction to make recognizing and working with the merged information easier. Layers with many connections, batch normalization, and dropout are used to meet this objective. They serve to find and capture the most essential details and simplify the input before the data is categorized.

v.
The output layer

A sigmoid activation function is found in the output layer, which is a fully connected layer. A sigmoid function is right for binary classification cases, such as determining whether a person has dementia based on a probability. At the final stage, the output layer blends textual and audio information to give the forecast.

vi.
Distinctiveness and precision

Its unique strength is using the combination of text and audio to assess dementia accurately. The model might efficiently grasp the main patterns of dementia using language and sound together. The LSTM (Long Short-Term Memory) analyzes how words in text are linked in sequence. Both batch normalization and dropout are applied to the audio model to increase performance. Using various information, such as voice data, makes the model better differentiating between women with and without dementia. The MMSN has been set up to combine the advantages of both tests to detect dementia with a unique strength.

Model compilation and training

The approach recommends Multimodal Siamese Networks, which use both speech and text information to provide detailed detection of dementia in women. The model also uses a Siamese network, which makes it possible to blend text with audio in the information processing. Combining these different data improves the accuracy of identifying a specific condition.

i.
Architecture of siamese network

The Siamese network serves as the foundation, using two branches to deal separately with text and audio data.
Processing of Text Data.
The Embedding Layer employs 10,000 embedding units to encode words in a 100-dimensional environment.
The LSTM layer utilizes 64 LSTM units to capture complex temporal relationships in textual input effectively.
Dense Layer: A solitary dense layer comprising 64 units to extract features.
Manipulation and Analysis of Audio Signals.
Two layers with 128 and 64 units, among many dense layers, are used to improve feature extraction.
Integration Layer.
Combine the information obtained from all modalities to enhance how the model notices subtle patterns.

ii.
Methods for regularization

Dropout: To avoid overfitting, a rate of 0.1 is set for each dropout applied to the LSTM and embedding layers. It makes the model better adjust and cope when problems occur during training.
Using batch normalization, each layer’s inputs are normalized during learning to promote better convergence and generalization of the neural network.

iii.
Optimizing hyperparameters

By applying a learning rate of 0.01, the Adam optimizer ensures that the optimization process does not diverge but keeps improving until it reaches the best result.
Batch Size: A batch size of 32 is enough so that training happens well and the memory is not exhausted, given the limited dataset stones for training.
Epochs: Training your model for 50 epochs solves the issues of underfitting and overfitting and allows the training process to converge well, so the model stays general.
Dropout Rate: Bringing the dropout rate to 0.1 keeps the model steady by avoiding underfitting and overfitting.
A combination of 10,000 embedding units that are each 100-dimensional helps the model capture detailed word relationships without slowing down the processing.

iv.
Hyperparameters

For our study “Multimodal Siamese Networks for Dementia Detection from Speech in Women,” we decided on and tuned important hyperparameters that helped the model perform better and be more reliable. Before training starts, hyperparameters are selected to help make learning more effective.
An 80% training set, 10% validation set, and 10% test set were used for the data. Also, we carried out 5-fold cross-validation and fixed the random seed at 42 for repeatability. Figure 18 shows the graph for dataset splitting.

A major hyperparameter is the learning rate, determining how much the optimizer advances at each step in training. It helps decide whether a model will converge slowly or fast and stably. With the different values tested, the optimal choice for Adam was 0.01. It was a good tradeoff, balancing how soon the system converged and how stable the result was. Picking a learning rate that is higher than needed can result in results that are not accurate, whereas picking a lower learning rate might make training take longer and cause the model to remain in a local minimum.
Another major hyperparameter is batch size, which specifies the amount of data used for each gradient update. The small size of the dataset made us choose a batch size of 32. Running training on a large batch size could make learning slower because the model gets updated less frequently. Using small batch sizes, models update their weights far more often, which can make predictions more general, but it slows down the training procedure.
We chose these hyperparameters (listed in our hyperparameter table) to make the model work well with a small amount of data and still be helpful in different cases.
Our model completed 50 epochs, each of which means one pass through the entire training dataset. Using a small number of epochs makes the model simple, while employing a large number might make it too accurate for the real world. So, the best number depends on the dataset and the task’s difficulty.
To avoid overfitting, we used dropout and set it to remove 10% of neurons after both the LSTM and embedding layers (dropout rate = 0.1). A good model does not have too many dropouts, or else it could start underfitting the data.
Words in a text are converted into dense vectors by embedding units. The input included up to 10,000 unique words, and the output was 100-dimensional embeddings. While bigger embedding dimensions help discover more complex links, they also ask for more work from the computer, which can result in overfitting.
The use of 64 LSTM units in the text model was enough to find the patterns in the information over time without facing overfitting. The number of units is based on how long and complicated the input must be.

Furthermore, Fig. 19 highlights the structure of the Siamese network we used to train the model using the settings discussed above.

Severe Dense Units and Layers: The model’s ability to learn complex relationships in fused multimodal data relies on the configuration of dense layers. The text model uses one dense layer with 64 units, the audio model uses two layers with 128 and 64 units, and the final classifier uses three layers with 128, 64, and 1 unit for binary classification. Increasingly dense layers or units can improve learning, but may lead to overfitting if overdone. Hyperparameter tuning, involving testing various value combinations, is crucial for optimizing performance, generalization, and convergence, mainly to ensure the Multimodal Siamese Network effectively detects female dementia from speech data.

v.
Uniqueness and Accuracy of the Model

Training Proficiency: During the training period, the model gained abilities and detected detailed patterns in both data types. With 50 epochs, training accuracy got as high as 98.46%.
Validation Robustness: The model is very robust because it achieves an error rate of zero on the unseen validation data. The good validation accuracy proves that the model can deal with data from many different sources.
Performance Visualization: Training and validation accuracy/loss over epochs are shown visually in Fig. 18. Improvement in training and validation accuracy and the fall in loss strongly support the model’s ability to generalize and give accurate forecasts.
Test Precision: The model shows strong robustness with a test precision of 0.0002066. The model’s ability to separate dementia cases from control participants entirely on new data is shown by its 100% accuracy in testing.
Multimodal Synergy: By using both text and sound, the model makes the most use of the input’s multimodal features. Being more accurate in testing demonstrates that it can use data from both tests to boost how well dementia can be detected.
Clinical Applicability: Using the model on the test set with perfect accuracy highlights its potential usefulness as a helpful tool in clinics, where identifying dementia means having accurate and reliable devices.
Impact on Dementia Diagnostics: The Siamese network’s success shows that it can significantly assist with dementia diagnostics. Visual and digital data help the computer identify and sort patterns, improving how well data is used in hospitals and clinics.

Multimodal Siamese Networks for Dementia Detection from Speech in Women prove superior to similar existing models. Using a mix of integration methods, network setup, regularization, and tuning hyperparameters results in a highly accurate dementia detection model. This model could help identify dementia at an early stage in women by analyzing their speech.

Model performance analysis

The study investigates how well a multimodal Siamese network can detect dementia in women using their voices. The Siamese network is built to handle audio and text data, emphasizing its usefulness in recognizing the challenging aspects of dementia.

i.
Training Process

The model learned much during training and could analyze both audio and text in a complex way.
On every epoch, the accuracy increased, reaching 98.46% at the end of the 50 epochs, indicating the model did well learning from the given data.

ii.
Validation Subset

Getting perfect accuracy on the validation subset suggests that the model can apply to data it has not seen before.
Having a high validation accuracy is an essential sign that the model applies its training properly to new datasets.

iii.
Performance Visualization

Figure 20 below illustrates how the training and validation accuracy/loss perform as we progress through each epoch.
As shown on the graph, training and validation accuracy are increasing, and loss is decreasing, which verifies the model’s ability to learn and make correct predictions.

Fig. 20 — Training and validation accuracy/Loss performances.

iv.
Test Results

The accuracy of the results proves that the model can work well in different situations.
Because the test loss is 0.0002066, it is evident that the model can identify cases of dementia and healthy people very precisely.
Accuracy scores reached 100%, asserting that the model can predict correctly on new cases it has not seen.

v.
Multimodal Integration

The model uses its capability to integrate audio and written data well.
The strong results in testing suggest that the model successfully combines information from both sounds and text, which supports its effectiveness in spotting symptoms of dementia.

vi.
Clinical Applicability

Reaching 100% accuracy on the test set proves that the model could be beneficial in helping find dementia.
Its results imply that the model could be helpful in medical settings since accurate and trustworthy tools are essential.

vii.
Contributions to Dementia Diagnostics

Its excellent performance suggests that the multimodal Siamese network positively affects the field of identifying dementia.
Processing both audio and text information allows it to check for dementia more fully, which in practice may help improve both accuracy and reliability.

viii.
Why This Model is Exceptional:

High Accuracy: The model demonstrated high accuracy during training with 98.46% accuracy and achieved 100% accuracy on both the validation and test subsets, which is very accurate and precise (Table 2).
The high level of generalization is shown by the model’s excellent accuracy on data that was not included in its training.
The model does well because it connects audio and text information, making it useful when information comes in multiple modes.
Being an accurate and diverse approach, the model can serve as a helpful diagnostic method in clinical care for people with dementia.

Table 2.

Hyperparameter settings.

Hyper-parameters	value
Batch size	32
Epochs	50
Learning rate	0.01
Dropout	0.1
Embedding units	100

Open in a new tab

To conclude, the multimodal Siamese network’s strong learning, generalization, and data type support suggest it will help detect dementia in women through their speech.

Model evaluation

For the model to perform well and remain reliable in many cases, various datasets are used when assessing the suggested compatible Siamese networks for detecting dementia in women. It measures more than the standard evaluation factors. Though the core metrics are accuracy, precision, recall, F1-score, and the ROC curve, more data makes the study more valuable.

Evaluation metrics

i.
Accuracy (ACC) checks how effectively a model groups events in the dataset. Equation 5 looks at true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Its complete assessment can be seen in how it analyzes how well the model works and how accurately it allocates the sample.

True Positives, True Negatives, False Positives, and False Negatives are respectively called TP, TN, FP, and FN.

ii.
PREC highlights cases where the model is correctly positive (using Eq. 6) and measures how well it avoids predicting false positives among these instances. For medical use, it is essential to prevent false positives. Considering specificity and sensitivity shows that a detailed assessment was used.

iii.
The main purpose of recall (RECALL) is to anticipate how many actual positive instances there will be accurately. How well the model detects successful cases can tell us more about its effectiveness in real life, especially since identifying cases early is essential for successful intervention.

Further, Table 3 shows the evaluation metrics of the proposed model.

Table 3.

Evaluation metrics of the proposed model.

Evaluation metric	Performance value
AUC	0.99
Accuracy	0.99
Precision	0.99
Recall	0.99
F1 score	0.99

Open in a new tab

iv.
Using the harmonic mean on accuracy and recall (Eq. 8) results in the F1-Score, a valuable indicator that includes both false positives and false negatives. This indicator shows how well the model works in different settings, and it is imperative when both high accuracy and high recall are wanted.

v.
Adding the ROC Curve shows how accurately the model performs as the criteria for separating data change. Using the AUC-ROC, or Area Under the ROC Curve, allows you to quantify how well your model can distinguish between positive and negative observations. Actual positive rate (TPR) and false positive rate (FPR) information help show how the model’s sensitivity and specificity change when different operations are applied.

vi.
The network in Fig. 21 can detect dementia by identifying women’s speech. Looking at the picture, a ROC value 1.0 means the method can almost perfectly tell whether someone has dementia. It shows that the model can perform well despite many challenges. A model’s performance in separating data sets is best evaluated using the Receiver Operating Characteristic (ROC) curve or the Area Under the ROC Curve. The ROC curve and AUC-ROC reported results of 1.0, and all the other measures used gave perfect scores. Setting the new diagnosis accuracy level comes from all scores converging across criteria, which demonstrates the model can categorize events properly. It got the highest possible score for accuracy, precision, recall, F1-score, and ROC curve. This proves that the model is ready to help patients. Examining speech in different ways makes it possible to detect dementia in women, and this helps with early disease management. An effective model may impact how dementia is diagnosed in actual clinical situations. Because the model is strong and performs well in many areas, it is helpful in clinical care. Because the model is reliable and accurate, medical professionals can quickly and accurately detect dementia.

In short, multimodal Siamese networks are convincingly stronger and better at recognizing dementia in women through speech than conventional criteria. Various datasets are used to strengthen how the ROC curve, accuracy, precision, recall, and F1-score are technically assessed. By following this complete framework, the study becomes reliable and relevant for use in many practical situations.

Confusion matrix

The Confusion Matrix allows measuring how well a model, such as the Multimodal Siamese Networks, classifies women with speech dementia. Here in Fig. 22, the matrix summarizes the results provided by the model, such as true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Where diagnosing dementia is sensitive, binary classification depends on the confusion matrix to examine the results. It shapes the data to illustrate how the model detects the presence of dementia clearly.

Fig. 22 — Confusion Matrix Heatmap of Dementia Detection Using Test Data.

The true positive (TP) rate is the number of dementia cases correctly identified. It proves the model’s dementia detection accuracy.
True Negative (TN): Representing cases correctly classified as negative (non-dementia), TN shows the model’s accuracy in identifying non-dementia cases.
False Positive (FP) Rate: Measures Type I error cases. It shows the model’s error rate by identifying non-dementia cases as positive.
False Negative (FN): Positive instances misclassified as negative (Type II mistake). FN shows the model’s dementia misidentification errors.

i.
Unveiling Model Performance using Confusion Matrix

The confusion matrix is crucial to understanding multimodal Siamese network performance. Understanding matrix components is vital for dementia detection, where precision is critical.
Analysis shows how the confusion matrix is used to evaluate model performance. Figure 20, the Confusion Matrix Heatmap, shows the model’s power. Two hundred forty-one controls are accurately recognized, highlighting TN cases. Two hundred eighty-nine dementia cases are.

ii.
Input Dataset Perspectives: Unpacking the Confusion

A confusion matrix is critical to solving this issue. While 241 people are correctly recognized as controls, suggesting strong TN performance, 289 are misclassified as dementia patients. FN scenarios occur when the model misidentifies dementia patients.
This highlights the need for detailed knowledge of the model’s performance across datasets. It promotes a model architecture and training approach review to improve real-world adaptability and efficacy.

iii.
Confusion Matrix in Perspective

Multimodal Siamese networks for dementia detection from women’s speech use the confusion matrix. Its extensive dissection of TP, TN, FP, and FN reveals model strengths and flaws. Beyond success on seen datasets, the matrix reveals performance nuances in the unknown.
This investigation shows the requirement for a complete review strategy beyond metrics. The confusion matrix’s granular insights guide model refinement, robustness in varied circumstances, and multimodal dementia detection.

Model Novel design

Dementia, a brain-related disease, calls for new methods to spot and care for people early. Multiple modes of data are fused with attention mechanisms by the study’s newly developed Siamese network to detect indicators of dementia in women via their spoken words. Moreover, our main contribution is to use a Siamese network with multiple data types, including text and audio, and to include an attention mechanism in the network’s text processing part. Because of this, the model now focuses on specific aspects of language in the transcript that could indicate usual signs associated with dementia.

Our Siamese design combines features from both the audio and attention-driven text parts. Thanks to this design, the model can use information from both sources, improving its grasp of the content.

i.
Architectural Design.

Figure 23 shows the intricate design of our model, which highlights essential features that improve dementia diagnosis—combining the methods described below forms a refined and efficient Siamese network for multimodal learning.

Incorporation of the Attention Mechanism

We do text processing in the Siamese network by including an attention mechanism. The enhanced model can now assess how important each word is in the transcript. This enables it to focus on critical linguistic features associated with dementia. By paying attention to detail and changing focus as needed, our approach can better detect the early signs of dementia than before.

Enhanced Feature Extraction

With attention built in, the model can better identify essential details in a text by keeping its meaning in mind. Because of this improvement, our model can pick up details that standard models could miss. The result is a clearer view of the original data, which allows people better to understand the internal relationship between language and dementia.

Integration of Different Modalities

The approach performs excellent cross-modal fusion by uniting features from the audio and textual modes. We use a Multimodal Siamese architecture for this purpose. Using information from different sources allows the model to understand the topic better. By blending data from audio and text, our approach obtains a better understanding and becomes stronger at finding dementia cases.

Improved Dementia Detection

Our main objective is to improve how dementia is recognized by considering both voice recordings and text notes. The model is directed by the attention process to highlight significant parts of the transcript, which could make it more accurate and sensitive at identifying dementia. Due to our advanced approaches, the model stands out as a forerunner in modern healthcare diagnostics.

ii.
Notable Contributions and Advancements

The model outlined here brings many unique and significant ideas quite different from what is usually done in transportation engineering.

Generalization Potential

Attention-based architectures allow for quick adaptation to different types of data. They work well in different situations because they offer strong representations that endure noise and changes in data. Focusing on the important parts of the document makes the model more flexible, enabling it to do well in many other speech-related jobs besides determining if someone has dementia.

Advancement of Multimodal AI

Through this analysis, it becomes clear that paying attention to what multiple inputs focus on improves the performance of multimodal AI models. The model outlines the possibility of using various data forms and advanced attention approaches to face challenging issues in the real world. From our research, blending speech with reading text improves multimodal AI.

Encouraging Findings

Our model seemed to perform well in the initial review because each criterion reached a score of 1. This demonstrates that our new framework may be practical in spotting dementia. Looking at additional data and actual cases can increase AI’s accuracy. Bringing attention mechanisms into the multimodal Siamese network is a significant research method advancement. Because of the enhanced feature extraction, it is believed that the accuracy of diagnosing dementia can be improved.

We have developed a new way to detect dementia in women using their voices through a multimodal Siamese architecture. Because of attention, fusion, and advanced feature extraction, our model is positioned to be a leader in advancing precise and sensitive diagnostic systems. Our multimodal artificial intelligence research supports the development of healthcare diagnostics and demonstrates the potential of attention-based architectures to handle challenging problems in the field.

Core contributions

By developing a unique detection method that combines recordings and documents, the research sustains the study of dementia. Here are the main points of what the study has contributed:

i.
Novel Multimodal Approach

This paper studies a new multimodal idea using speech transcripts and audio in dementia detection. With this integration, it is possible to see both the speech’s language and sound, which makes the data more complete. The key feature is linking these techniques together, which allows the model to find unusual signals of dementia better. The study benefits from using multimodal methods, which opens doors for better reliability and accuracy in diagnosis.

Strengths
Comprehensive data representation: Audio and transcripts available can help us better understand the speech characteristics of dementia patients.
Enhanced pattern recognition: The model can notice more patterns when using different data sources.
Weaknesses
Increased complexity: Adding more modalities to the model can make the architecture and training process more complicated.
Data pre-processing challenges: Integrating audio and text requires properly pre-processing the data, which may not be simple.

ii.
Siamese Architecture

The study uses a Siamese network to blend audio and text data. Because of this structure, the model is able to identify similar concepts from different data types and combine them meaningfully. By using a Siamese network, the model can learn how audio and text are related, which improves its accuracy in spotting signs of dementia.

Strengths
Cross-modality representation learning: The Siamese architecture helps the model find representations that are used across multiple types of data.
Increased model expressiveness: More complex relationships are represented, allowing the model to be more expressive and accurate.
Weaknesses
Computational demands: Since Siamese networks use more resources, they might not be able to support large-scale projects as easily as simple models.

iii.
Data Augmentation

To address the lack of data, the study applies methods like Data Augmentation to enhance the inputs and improve the model’s accuracy. Changing the training samples helps the model handle new situations and reduces the risk of overfitting.

Strengths
Improved generalization: Data augmentation addresses overfitting because it exposes the model to many cases.
Enhanced dataset breadth: A large variety of data is possible with augmentation, which helps cover numerous potential variations.
Weaknesses
Quality of augmented data: The success of data augmentation depends on whether the new samples are high-quality, and low-quality samples may add noise.

iv.
Comprehensive Evaluation

It uses various checks for accuracy, precision, recall, F1-score, and ROC analysis to ensure the model always performs correctly. This detailed assessment indicates whether the approach is effective at correctly discovering signs of dementia.

Strengths
Rigorous evaluation metrics: Several ways of evaluating the model provide accurate information about what it does well and where it may improve.
Robustness verification: Thorough testing of the model’s performance increases the certainty that it will perform reliably for several metrics.
Weaknesses
Dataset bias: The usefulness of the study’s results depends on the quality and true range of the Dementia Bank Database; any data bias could give inaccurate results.

v.
Potential Clinical Implications

The developed Siamese network, which may impact clinical practice, might help detect dementia in women early. Its high accuracy and ability to discriminate among patients make it helpful in locating those who need extra help, acting promptly, and raising the standard of care.

Strengths
Clinical relevance: The model is clinically useful because it helps identify and address dementia in the early stages.
Improvement in patient care: Since timely diagnosis causes better outcomes, this model could significantly affect healthcare settings.
Weaknesses
Real-world implementation challenges: Applying research in real situations can bring up obstacles such as securing patient data and following regulations.

vi.
Gender-Specific Analysis

The field examines gender-specific aspects since both physical and behavioural characteristics can appear differently in men and women. This method increases our understanding of how gender affects dementia diagnosis, which may, in turn, inspire better diagnostic methods.

Strengths
Tailored insights: By studying women specifically, researchers can see new details that affect women with dementia.
Potential for personalized interventions: Some behavioural traits related to gender can influence how to treat or support a person, so gender-specific knowledge could support using different treatment and support plans.
Weaknesses
Limited generalizability: Since gender-based findings may not work for the entire population, these should be tested further on a wide range of people.

vii.
Interpretable Representations

It explains that interpretable representations can be obtained from the Siamese model without losing performance. Sticking to interpretable models makes the predictions trustworthy and clear, which is needed for clinical purposes.

Strengths
Transparent decision-making: The capability to observe and understand model decisions helps clinicians earn their trust and agreement.
Efficient computation: The model is efficient, making it work well for applications where speed is essential.
Weaknesses
Interpretability tradeoff: Tradeoffs may arise between making a model interpretable and maintaining its high performance on complex tasks.

viii.
Contributions to Dementia Research

The paper advances the way researchers in dementia research find signs of the disease by using both audio and transcript data. New deep learning models and using different data types at once help provide better insights into how multiple ways to share data can improve diagnosis.

Strengths
Advancing the field: The research adds essential information to what others know about detecting dementia.
Methodological innovation: Integrating different kinds of imaging and complex algorithms is considered an innovative approach that could influence work outside dementia research.
Weaknesses
Generalization challenges: It is essential to prove that the method can be used for various data and populations because its results could vary from one situation to another.

In short, the research paper breaks ground with innovative multimodal strategies, Siamese architecture reliance, advanced data expansion techniques, exhaustive evaluation, the ability to influence healthcare for dementia patients, gender-specific exploration, explicit representations, and improvements to dementia research. The suggested approach shows several advantages, for example, increased abilities to detect cancer and find uses in medicine, even though it does have downsides such as increased complexity, bias in the training data, and requires attending to problems of interpretation. Tackling these difficulties and performing more studies to confirm the results will be necessary to turn this research into helpful tools doctors can use to detect dementia in women.

Comparison with baseline

More research is being done today on detecting dementia by studying speech. These studies have applied different approaches to enhance how accurately and reliably the detection happens. People in this field regularly look into new deep learning frameworks, ensembles, and old machine learning strategies to improve the field’s current state. For these studies, the Dementia Bank database is significant because it contains many speech samples from both people with dementia and those without. It reviews famous neural networks by measuring how well they can identify data from the Dementia Bank, providing valuable insights. Below, you can compare the past papers with our model in Table 4.

Table 4.

Comparison with baseline models.

Reference	Approach	Accuracy	Dataset
⁷	Bidirectional Encoder Representations from Transformers (BERT) model	89.9%	Dementia Bank Database
¹¹	Gated Convolutional Neural Networks (GCNN).	80.8%	Dementia Bank Database
¹⁴	Ensemble Technique	86.5%	Dementia Bank Database
¹⁷	Stacked Deep Dense Neural Network.	93.31%	Dementia Bank Database
²⁰	SVM	77%	Dementia Bank Database
²⁴	BERT Model	90.36%	Dementia Bank Database
²⁸	PRCNN (Parallel Recurrent CNN), RF, Random Tree, REP Tree, SVM, and ANN	85%, 87.6,77.4,82.6,80.3,81.5	Dementia Bank Database
Our Approach	Multimodal Siamese Networks	99%	Dementia Bank Database

Open in a new tab

BERT, considered a top model, was reported to have an accuracy of 89.9% in⁷. BERT, widely known for understanding language well, also demonstrated good abilities in discovering the main features linked to recognizing dementia. The researchers achieved an accuracy of 80.8% when they applied the Gated Convolutional Neural Networks (GCNN) mentioned in¹¹. Aside from this, GCNN did quite well, but had lower accuracy than BERT. Experts claim that transformers can make better contextualized embeddings, which may boost their performance in particular conditions.

An ensemble method described in¹⁴ allowed a combination of many learners to achieve an accuracy of 86.5%. In most cases, using multiple models contributes to less frequent overfitting and improves how the algorithm builds generalizations, which helps it identify dementia more reliably. It turns out that the stacked deep dense neural network, mentioned in¹⁷, had the best performance among market categories, with an exceptional accuracy of 93.31%. Deep learning allows these models to spot intricate patterns in the Dementia Bank dataset and works better.

The study also included traditional methods such as the Support Vector Machine (SVM) model, which reached an accuracy rate of 77%²⁰. Various industries have widely used SVMs. Nevertheless, their lower ability to discover dementia suggests that the Dementia Bank data might require more advanced models to get the best results.

A further investigation looked at the BERT mentioned previously in²⁴ and improved its accuracy to 90.36%. Specifically, training BERT for dementia detection confirmed how adaptable transformer-based structures are. In²⁸, the PRCNN model was found to have an accuracy of 85%, which is a good result compared to other models. More models like RF, Random tree, REP, SVM, and ANN were also checked for accuracy in the same study. Furthermore, using trees or hyperplanes, traditional models struggled to handle noisy data and complex language features. However, we rely on attention models, feature embedding, and Siamese networks to detect complex patterns, which significantly improves the system’s performance.

We performed superiorly to other tests by including our unique multimodal Siamese networks that achieved 99% accuracy. The model has been fine-tuned and is now a top pick for spotting dementia in recorded speech from the Dementia Bank. The combination of Siamese networks, which help to understand how pairs in data relate to one another, and a multimodal approach allowed the system to understand the more complicated aspects of the dataset more clearly.

Furthermore, our networks’ ability to analyze several data types help us identify connections between speech and related features. The way Siamese architecture is built mainly focuses on discovering discriminative elements and weak links. For dementia, catching early speech hints is essential, because they can show the beginning changes in the brain. The more modes incorporated, the easier it becomes for the model to access and blend information from many sources, which may increase its stability and usefulness in various situations.

In the context of this research, a new way to diagnose dementia is outlined using both transcripts and audio recordings. When we combine what is heard in the audio with text that has been transcribed, we improve the model’s ability to spot minimal signs of dementia that only speech could reveal. To blend audio and text well, we settled on Siamese architecture. Because of this arrangement, the model can find similar features in different data types, allowing different data forms to support and enrich each other. A complete assessment demonstrates the solid performance of the model. The model’s accuracy, sensitivity, and value in finding dementia are shown by checking its accuracy, precision, recall, F1-score, and the ROC curve.

Overall, comparison research finds that dementia detection algorithms created using Dementia Bank reveal detailed and complex aspects. Our research has suggested using multimodal Siamese networks, which have shown accuracy up to 99%, which is better than other models. Joining deep learning, multimodal fusion, and Siamese architectures improves accuracy in spotting speech patterns of individuals with dementia. While the area develops, attention must be given to model accuracy and the idea that models are understandable and can be implemented.

Discussion

This study mainly seeks to build and test a network that can utilize speech to spot signs of dementia in women. This approach, using deep learning and analyzing several types of medical information, is meant to improve dementia diagnosis. It is especially interesting to use a Siamese network because it helps the model find the relationships between the two modes—audio and text. It teaches how to recognize different ways of speaking caused by dementia in women.

In addition, using speech data from the Pitt Corpus in the Dementia Bank database helps to maintain a reliable study outcome. The Dementia Bank is a known collection of interactions in different formats chosen for study in dementia research and communication. The research certifies that the model is trained and examined using speech examples commonly linked to dementia. Good data management increases the credibility and usefulness of the recommended MMSN.

This paper also points out that multimodal Siamese networks are essential for analyzing and understanding data in deep learning. Voice-based dementia detection with MMSN demonstrates how machine learning is advancing in healthcare. Because the model can look at both audio and text at the same time, it helps understand the detailed forms of dementia in speech, and this method makes it useful for early detection and management.

This investigation also emphasizes a detailed examination of the planned MMSN. A study uses multiple datasets and surpasses basic assessment standards to determine how well and dependably the model works. Assessment measures such as AUC, accuracy, precision, recall, and F1 score keep giving top-notch results, all getting a value of 0.99. The great outcomes achieved show that MMSN works well, making it one of the best models for dementia detection in women using speech analysis.

Also, the comparison analyzes the MMSN model by putting it up against several other approaches, such as the Bidirectional Encoder Representations from Transformers (BERT) model, Gated Convolutional Neural Networks (GCNN), Ensemble Technique, Stacked Deep Dense Neural Network, Support Vector Machines (SVM), BERT Model, and Parallel Recurrent CNN (PRCNN). The data in the Dementia Bank Database is the standard that ensures the evaluation is honest and fair. The MMSN has an excellent accuracy rate of 99% compared to others. Because of its high accuracy, the MMSN proves to be an effective and dependable way to detect dementia, better than previous techniques.

This research affects more than just the specialty of diagnosing dementia. It points out that these and other approaches allow artificial intelligence to perform well in healthcare applications. The mixed use of audio and text data and attention-based techniques achieves good outcomes. The MMSN needs to be checked more, which causes researchers to research similar multimodal methods in other branches of medical science. In other words, the research plays a significant role in leveraging AI for early and accurate detection of illnesses, with dementia in women as a key interest.

Conclusion

We are using multimodal Siamese networks on speech to move forward in detecting dementia, placing extra attention on women. Since dementia is very complex, old ways of spotting it are not effective and must be changed. According to our studies, these advanced networks can give us accurate and rapid diagnoses.

The outstanding performance our model achieved proves that it can be trusted. Our models are robust in finding signs of dementia in voice data, exhibiting top results in the Area Under the Curve (AUC), accuracy, precision, recall, and F1 score, all above 0.99. These results prove the model’s skill in recognizing true dementia cases and illustrate how it can decrease the incorrect categorization of patients. In medicine, these aspects are very significant since an incorrect diagnosis can greatly affect the patient.

Another set of comparative studies with well-known models shows that our suggested approach is the best, reaching a success rate of 99%. Although widely used, support vector machines (SVM) and ensemble methods do worse than some unique and up-to-date techniques in finding dementia. Performance studies show that multimodal Siamese networks do better than BERT and PRCNN. It shows that Siamese architecture and using different types of data make it easier to spot complex trends in the information.

A well-stacked neural network of dense layers performs very well. By comparison, multimodal Siamese networks are more convenient and easier to grasp, which explains their usefulness in medical fields. Understanding and expressing medical model judgments are essential for diagnosis. Because Siamese networks operate transparently, it helps to understand what variables affect the detection of dementia. Trust among healthcare professionals can be ensured, and our methods can be smoothly applied if we are transparent.

Besides, addressing gender differences in dementia research will be a focus, and we will examine the way women speak. Such research is needed because dementia often shows itself differently in men and women. By doing our research, we will better understand the differences and develop tailor-made strategies for diagnosing and treating patients. As a result of this study, we can offer more suitable and effective management strategies to people with dementia by considering and using gender-based information.

Our conclusions affect more than just the immediate task of identifying dementia. Because multimodal Siamese networks are effective at discerning complex voice patterns, their utility in other fields is evident. Because technology is always advancing, these networks might one day help with other cognitive illnesses, enhancing healthcare overall.

Besides providing a new way to detect dementia, our study points out the need to study these diseases separately by gender to better understand and care for those with neurodegenerative conditions. With our multimodal Siamese networks, we are seeing significant progress toward fast and reliable medical findings. This means there is hope for people who may get dementia, and also enhances the possibility that technology will significantly impact health care.

Prospective endeavors

Our research has advanced dementia detection, though there are still many more areas that can be explored and improved.

Improving the model’s generalization helps to include more diverse individuals in the dataset. If demographic, ethnic, and geographic data are incorporated, the model can represent a larger part of the population, and thus, its robustness will increase.
Taking multiple measurements of people over time as part of a longitudinal study gives a better picture of how dementia develops. This would encourage the creation of models that spot and predict dementia, allowing personalized and quick treatment.
While we have mainly worked with speech data, imaging (e.g., MRI or CT scans) and behavioural data can provide a broader understanding of the condition. Combining different approaches can improve the diagnostic model’s effectiveness and reliability.
Interpreting the model’s results helps make its judgments more trustworthy in the medical setting. Creating methods for understanding the characteristics learned by multimodal Siamese networks is important since this will reveal what the network believes is necessary for a diagnosis.
Real-world Deployment: It takes many steps to move a model out of research into real-world medical care, since issues like ethics, required regulations, and fitting into existing medical systems arise. More research is necessary to improve this transition so that the model will be helpful and accessible to healthcare workers.
Involvement with medical experts helps improve the model by including clinicians’ insights and knowledge. Clinician feedback can make the model more effective for specific diagnostic problems and match it with what healthcare workers need.

Multimodal Siamese networks were highly successful in identifying dementia from speech data, mainly highlighted by their success with female subjects. Compared to old models, the excellent results indicate that the technique could influence early and precise diagnosis. There is, however, continued effort to develop a thorough process for finding out about dementia. Ways to enhance the model’s skills and ensure its use in healthcare environments are described in the plan. Researchers, clinicians, and technologists working together will make a big difference in improving how dementia is diagnosed and treated in the future.

Acknowledgements

This work was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, through the Research Groups Program Grant no. (RGP-1444-0057)

Abbreviations

DL: Deep learning
DDNN: Deep dense neural network
CNN: Convolutional neural network
MSN: Multimodal Siamese network
KNN: K-Nearest neighbour
ADReSS: Alzheimer’s dementia recognition via spontaneous speech
HC: Healthy control
eGeMAPS: Extended geneva minimalistic acoustic parameter set
MLP: Multi-layer perceptron
MRCG: Multi-resolution cochleagram features
STFT: Short-time fourier transform
RMSE: Root mean square error
AUC: Area under the curve
SDDNN: Stacked deep dense neural networks
AI: Artificial intelligence
ML: Machine learning
SVM: Support vector machine
LSTM: Long short-term memory
RF: Random forest
MMSE: Mini-mental state exam
GCNN: Gated convolutional neural networks
AD: Alzheimer’s disease
ADR: Active data representation
PRCNN: Parallel recurrent CNN
EDA: Exploratory data analysis
MFCC: Mel-frequency cepstral coefficients
ReLU: Rectified linear unit
ROC: Receiver operating characteristic
BiLSTM: Bidirectional long-short-term memory

Author contributions

The authors confirm contribution to the paper as follows: study conception and design: W. Ahlem, K. Amel; data collection: Z. Mohammed; analysis and interpretation of results: A. Manel, A. Tagrid. A. A. Nouf; draft manuscript preparation: Z. Mohammed. W. Ahlem, K. Amel. All authors reviewed the results and approved the final version of the manuscript.

Funding

This work was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University, through the Research Groups Program Grant no. (RGP-1444-0057)

Data availability

The dataset is available using this link: https://www.kaggle.com/datasets/nodoubttome/skin-cancer9-classesisic.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ahlem Walha, Mohammed Zakariah, Manel Ayadi, Tagrid Alshalali and Nouf Abdullah Almujally have contributed equally to this work.

References

1.Shah, Z. et al. Learning Language and Acoustic Models for Identifying Alzheimer’s Dementia From Speech. Front. Comput. Sci.10.3389/fcomp.2021.624659 (2021). [Google Scholar]
2.Sarawgi, U., Zulfikar, W., Soliman, N. & Maes, P. Multimodal inductive transfer learning for detection of alzheimer’s dementia and its severity. In Interspeech 2020 2212–2216 (ISCA, 2020). 10.21437/Interspeech.2020-3137. [Google Scholar]
3.Wallensten, J. et al. Stress, depression, and risk of dementia – a cohort study in the total population between 18 and 65 years old in Region Stockholm. Alzheimers Res. Ther.15(1), 161. 10.1186/s13195-023-01308-4 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Spiers, H. J., Coutrot, A. & Hornberger, M. Explaining World-Wide Variation in Navigation Ability from Millions of People: Citizen Science Project Sea Hero Quest. Top. Cogn. Sci.15(1), 120–138. 10.1111/tops.12590 (2023). [DOI] [PubMed] [Google Scholar]
5.Bhattacharya, S., Heidler, P. & Varshney, S. Incorporating neglected non-communicable diseases into the national health program—A review. Front. Public. Health10.3389/fpubh.2022.1093170 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fucci, D. et al. Oct., A longitudinal cohort study on the retainment of test-driven development, in Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, New York, NY, USA: ACM, pp. 1–10. 10.1145/3239235.3240502 (2018).
7.Han, H. J., Qiu, S. B. N. L. & Abdullah, S. Automatic classification of dementia Using text and speech data, 399–407. 10.1007/978-3-031-14771-5_29 (2023).
8.Rubio-Martín, S., García-Ordás, M. T., Bayón-Gutiérrez, M., Prieto-Fernández, N. & Benítez-Andrades, J. A. Early Detection of Autism Spectrum Disorder through AI-Powered Analysis of Social Media Texts, in 2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS), IEEE, pp. 235–240. 10.1109/CBMS58004.2023.00223 (2023).
9.Martinc, M. & Pollak, S. Tackling the adress challenge: A multimodal approach to the automated recognition of alzheimer’s dementia. In Interspeech 2020 2157–2161 (ISCA, 2020). 10.21437/Interspeech.2020-2202. [Google Scholar]
10.Ilias, L., Askounis, D. & Psarras, J. A Multimodal Approach for Dementia Detection from Spontaneous Speech with Tensor Fusion Layer, in IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), IEEE, Sep. 2022, pp. 1–5. 10.1109/BHI56158.2022.9926818 (2022).
11.M. RODRIGUES MAKIUCHI et al., Speech Paralinguistic Approach for Detecting Dementia Using Gated Convolutional Neural Network, IEICE Trans. Inf .Syst.E104.D(11) 2020EDP7196 10.1587/transinf.2020EDP7196 (2021).
12.Meghanani, A., Anoop, C. S. & Ramakrishnan, A. G. Recognition of Alzheimer’s Dementia From the Transcriptions of Spontaneous Speech Using fastText and CNN Models. Front. Comput. Sci.10.3389/fcomp.2021.624558 (2021). [Google Scholar]
13.Shi, M., Cheung, G. & Shahamiri, S. R. Speech and language processing with deep learning for dementia diagnosis: A systematic review. Psychiatry Res.329, 115538. 10.1016/j.psychres.2023.115538 (2023). [DOI] [PubMed] [Google Scholar]
14.Ilias, L. & Askounis, D. Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts. Front. Aging Neurosci.10.3389/fnagi.2022.830943 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Zhou, Q., Wang, J., Yu, X., Wang, S. & Zhang, Y. A Survey of Deep Learning for Alzheimer’s Disease. Mach. Learn Knowl. Extr.10.3390/make5020035 (2023). [Google Scholar]
16.Zheng, C. et al. Detecting Dementia from Face-Related Features with Automated Computational Methods. Bioengineering10(7), 862. 10.3390/bioengineering10070862 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Khan, Y. F., Kaushik, B., Rahmani, M. K. I. & Ahmed, M. E. Stacked deep dense neural network model to predict alzheimer’s dementia using audio transcript data. IEEE Access.10, 32750–32765. 10.1109/ACCESS.2022.3161749 (2022). [Google Scholar]
18.Campbell, E. L., Docio-Fernandez, L., Jiménez-Raboso, J. & Gacia-Mateo, C. Alzheimer’s dementia detection from audio and Language modalities in spontaneous speech. In IberSPEECH 2021 270–274 (ISCA, 2021). 10.21437/IberSPEECH.2021-57. [Google Scholar]
19.Edwards, E., Dognin, C., Bollepalli, B. & Singh, M. Multiscale system for alzheimer’s dementia recognition through spontaneous speech. In Interspeech 2020 2197–2201 (ISCA, 2020). 10.21437/Interspeech.2020-2781. [Google Scholar]
20.Santander-Cruz, Y., Salazar-Colores, S., Paredes-García, W. J., Guendulain-Arenas, H. & Tovar-Arriaga, S. Semantic Feature Extraction Using SBERT for Dementia Detection. Brain Sci.12(2), 270. 10.3390/brainsci12020270 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Meghanani, A. & Ramakrishnan, A. G. An Exploration of Log-Mel Spectrogram and MFCC Features for Alzheimer’s Dementia Recognition from Spontaneous Speech. In 2021 IEEE Spoken Language Technology Workshop (SLT) 670–677 (IEEE, 2021). 10.1109/SLT48900.2021.9383491. [Google Scholar]
22.Inoue, Y., Shue, F., Bu, G. & Kanekiyo, T. Pathophysiology and probable etiology of cerebral small vessel disease in vascular dementia and Alzheimer’s disease. Mol. Neurodegener18(1), 46. 10.1186/s13024-023-00640-5 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Cutuli, D., Decandia, D., Giacovazzo, G. & Coccurello, R. Physical Exercise as Disease-Modifying Alternative against Alzheimer’s Disease: A Gut–Muscle–Brain Partnership. Int. J. Mol. Sci.24(19), 14686. 10.3390/ijms241914686 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ortiz-Perez, D. et al. A Deep Learning-Based Multimodal Architecture to predict Signs of Dementia. Neurocomputing548, 126413. 10.1016/j.neucom.2023.126413 (2023). [Google Scholar]
25.Syed, M. S. S., Syed, Z. S., Lech, M. & Pirogova, E. Automated screening for alzheimer’s dementia through spontaneous speech. In Interspeech 2020 2222–2226 (ISCA, 2020). 10.21437/Interspeech.2020-3158. [Google Scholar]
26.Liu, L., Zhao, S., Chen, H. & Wang, A. A new machine learning method for identifying Alzheimer’s disease. Simul. Model. Pract. Theory99, 102023. 10.1016/j.simpat.2019.102023 (2020). [Google Scholar]
27.Liu, Y. L., Feng, R., Yuan, J. H. & Ling, Z. H. Clever Hans Effect Found in Automatic Detection of Alzheimer’s Disease through Speech. Interspeech 20222435–2439, 10.21437/interspeech.2024-1018 (2024). [Google Scholar]
28.Kumar, M. R. et al. Dementia detection from speech using machine learning and deep learning architectures. Sensors22(23), 9311. 10.3390/s22239311 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Nasreen, S., Rohanian, M., Hough, J. & Purver, M. Alzheimer’s dementia recognition from spontaneous speech using disfluency and interactional features. Front. Comput. Sci.10.3389/fcomp.2021.640669 (2021). [Google Scholar]
30.Mahajan, P. & Baths, V. Acoustic and Language Based Deep Learning Approaches for Alzheimer’s Dementia Detection From Spontaneous Speech. Front. Aging Neurosci.10.3389/fnagi.2021.623607 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Yang, Q., Li, X., Ding, X., Xu, F. & Ling, Z. Deep learning-based speech analysis for Alzheimer’s disease detection: a literature review. Alzheimers Res. Ther.14(1), 186. 10.1186/s13195-022-01131-3 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Haider, F., de la Fuente, S. & Luz, S. An assessment of paralinguistic acoustic features for detection of alzheimer’s dementia in spontaneous speech. IEEE J. Sel.Top Signal Process14(2), 272–281. 10.1109/JSTSP.2019.2955022 (2020). [Google Scholar]
33.Zhao, C. et al. TMTNet: A Transformer-Based Multimodality Information Transfer Network for Hyperspectral Object Tracking. Remote Sens. (Basel)15(4), 1107. 10.3390/rs15041107 (2023). [Google Scholar]
34.DementiaBank English Pitt Corpus., Dementia.talkbank.org.
35.Alqahtani, N. et al. Deep Belief Networks (DBN) with IoT-Based Alzheimer’s Disease Detection and Classification. Appl. Sci.10.3390/app13137833 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset is available using this link: https://www.kaggle.com/datasets/nodoubttome/skin-cancer9-classesisic.

[CR1] 1.Shah, Z. et al. Learning Language and Acoustic Models for Identifying Alzheimer’s Dementia From Speech. Front. Comput. Sci.10.3389/fcomp.2021.624659 (2021). [Google Scholar]

[CR2] 2.Sarawgi, U., Zulfikar, W., Soliman, N. & Maes, P. Multimodal inductive transfer learning for detection of alzheimer’s dementia and its severity. In Interspeech 2020 2212–2216 (ISCA, 2020). 10.21437/Interspeech.2020-3137. [Google Scholar]

[CR3] 3.Wallensten, J. et al. Stress, depression, and risk of dementia – a cohort study in the total population between 18 and 65 years old in Region Stockholm. Alzheimers Res. Ther.15(1), 161. 10.1186/s13195-023-01308-4 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Spiers, H. J., Coutrot, A. & Hornberger, M. Explaining World-Wide Variation in Navigation Ability from Millions of People: Citizen Science Project Sea Hero Quest. Top. Cogn. Sci.15(1), 120–138. 10.1111/tops.12590 (2023). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Bhattacharya, S., Heidler, P. & Varshney, S. Incorporating neglected non-communicable diseases into the national health program—A review. Front. Public. Health10.3389/fpubh.2022.1093170 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Fucci, D. et al. Oct., A longitudinal cohort study on the retainment of test-driven development, in Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, New York, NY, USA: ACM, pp. 1–10. 10.1145/3239235.3240502 (2018).

[CR7] 7.Han, H. J., Qiu, S. B. N. L. & Abdullah, S. Automatic classification of dementia Using text and speech data, 399–407. 10.1007/978-3-031-14771-5_29 (2023).

[CR8] 8.Rubio-Martín, S., García-Ordás, M. T., Bayón-Gutiérrez, M., Prieto-Fernández, N. & Benítez-Andrades, J. A. Early Detection of Autism Spectrum Disorder through AI-Powered Analysis of Social Media Texts, in 2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS), IEEE, pp. 235–240. 10.1109/CBMS58004.2023.00223 (2023).

[CR9] 9.Martinc, M. & Pollak, S. Tackling the adress challenge: A multimodal approach to the automated recognition of alzheimer’s dementia. In Interspeech 2020 2157–2161 (ISCA, 2020). 10.21437/Interspeech.2020-2202. [Google Scholar]

[CR10] 10.Ilias, L., Askounis, D. & Psarras, J. A Multimodal Approach for Dementia Detection from Spontaneous Speech with Tensor Fusion Layer, in IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), IEEE, Sep. 2022, pp. 1–5. 10.1109/BHI56158.2022.9926818 (2022).

[CR11] 11.M. RODRIGUES MAKIUCHI et al., Speech Paralinguistic Approach for Detecting Dementia Using Gated Convolutional Neural Network, IEICE Trans. Inf .Syst.E104.D(11) 2020EDP7196 10.1587/transinf.2020EDP7196 (2021).

[CR12] 12.Meghanani, A., Anoop, C. S. & Ramakrishnan, A. G. Recognition of Alzheimer’s Dementia From the Transcriptions of Spontaneous Speech Using fastText and CNN Models. Front. Comput. Sci.10.3389/fcomp.2021.624558 (2021). [Google Scholar]

[CR13] 13.Shi, M., Cheung, G. & Shahamiri, S. R. Speech and language processing with deep learning for dementia diagnosis: A systematic review. Psychiatry Res.329, 115538. 10.1016/j.psychres.2023.115538 (2023). [DOI] [PubMed] [Google Scholar]

[CR14] 14.Ilias, L. & Askounis, D. Multimodal Deep Learning Models for Detecting Dementia From Speech and Transcripts. Front. Aging Neurosci.10.3389/fnagi.2022.830943 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Zhou, Q., Wang, J., Yu, X., Wang, S. & Zhang, Y. A Survey of Deep Learning for Alzheimer’s Disease. Mach. Learn Knowl. Extr.10.3390/make5020035 (2023). [Google Scholar]

[CR16] 16.Zheng, C. et al. Detecting Dementia from Face-Related Features with Automated Computational Methods. Bioengineering10(7), 862. 10.3390/bioengineering10070862 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Khan, Y. F., Kaushik, B., Rahmani, M. K. I. & Ahmed, M. E. Stacked deep dense neural network model to predict alzheimer’s dementia using audio transcript data. IEEE Access.10, 32750–32765. 10.1109/ACCESS.2022.3161749 (2022). [Google Scholar]

[CR18] 18.Campbell, E. L., Docio-Fernandez, L., Jiménez-Raboso, J. & Gacia-Mateo, C. Alzheimer’s dementia detection from audio and Language modalities in spontaneous speech. In IberSPEECH 2021 270–274 (ISCA, 2021). 10.21437/IberSPEECH.2021-57. [Google Scholar]

[CR19] 19.Edwards, E., Dognin, C., Bollepalli, B. & Singh, M. Multiscale system for alzheimer’s dementia recognition through spontaneous speech. In Interspeech 2020 2197–2201 (ISCA, 2020). 10.21437/Interspeech.2020-2781. [Google Scholar]

[CR20] 20.Santander-Cruz, Y., Salazar-Colores, S., Paredes-García, W. J., Guendulain-Arenas, H. & Tovar-Arriaga, S. Semantic Feature Extraction Using SBERT for Dementia Detection. Brain Sci.12(2), 270. 10.3390/brainsci12020270 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Meghanani, A. & Ramakrishnan, A. G. An Exploration of Log-Mel Spectrogram and MFCC Features for Alzheimer’s Dementia Recognition from Spontaneous Speech. In 2021 IEEE Spoken Language Technology Workshop (SLT) 670–677 (IEEE, 2021). 10.1109/SLT48900.2021.9383491. [Google Scholar]

[CR22] 22.Inoue, Y., Shue, F., Bu, G. & Kanekiyo, T. Pathophysiology and probable etiology of cerebral small vessel disease in vascular dementia and Alzheimer’s disease. Mol. Neurodegener18(1), 46. 10.1186/s13024-023-00640-5 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Cutuli, D., Decandia, D., Giacovazzo, G. & Coccurello, R. Physical Exercise as Disease-Modifying Alternative against Alzheimer’s Disease: A Gut–Muscle–Brain Partnership. Int. J. Mol. Sci.24(19), 14686. 10.3390/ijms241914686 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Ortiz-Perez, D. et al. A Deep Learning-Based Multimodal Architecture to predict Signs of Dementia. Neurocomputing548, 126413. 10.1016/j.neucom.2023.126413 (2023). [Google Scholar]

[CR25] 25.Syed, M. S. S., Syed, Z. S., Lech, M. & Pirogova, E. Automated screening for alzheimer’s dementia through spontaneous speech. In Interspeech 2020 2222–2226 (ISCA, 2020). 10.21437/Interspeech.2020-3158. [Google Scholar]

[CR26] 26.Liu, L., Zhao, S., Chen, H. & Wang, A. A new machine learning method for identifying Alzheimer’s disease. Simul. Model. Pract. Theory99, 102023. 10.1016/j.simpat.2019.102023 (2020). [Google Scholar]

[CR27] 27.Liu, Y. L., Feng, R., Yuan, J. H. & Ling, Z. H. Clever Hans Effect Found in Automatic Detection of Alzheimer’s Disease through Speech. Interspeech 20222435–2439, 10.21437/interspeech.2024-1018 (2024). [Google Scholar]

[CR28] 28.Kumar, M. R. et al. Dementia detection from speech using machine learning and deep learning architectures. Sensors22(23), 9311. 10.3390/s22239311 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Nasreen, S., Rohanian, M., Hough, J. & Purver, M. Alzheimer’s dementia recognition from spontaneous speech using disfluency and interactional features. Front. Comput. Sci.10.3389/fcomp.2021.640669 (2021). [Google Scholar]

[CR30] 30.Mahajan, P. & Baths, V. Acoustic and Language Based Deep Learning Approaches for Alzheimer’s Dementia Detection From Spontaneous Speech. Front. Aging Neurosci.10.3389/fnagi.2021.623607 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Yang, Q., Li, X., Ding, X., Xu, F. & Ling, Z. Deep learning-based speech analysis for Alzheimer’s disease detection: a literature review. Alzheimers Res. Ther.14(1), 186. 10.1186/s13195-022-01131-3 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Haider, F., de la Fuente, S. & Luz, S. An assessment of paralinguistic acoustic features for detection of alzheimer’s dementia in spontaneous speech. IEEE J. Sel.Top Signal Process14(2), 272–281. 10.1109/JSTSP.2019.2955022 (2020). [Google Scholar]

[CR33] 33.Zhao, C. et al. TMTNet: A Transformer-Based Multimodality Information Transfer Network for Hyperspectral Object Tracking. Remote Sens. (Basel)15(4), 1107. 10.3390/rs15041107 (2023). [Google Scholar]

[CR34] 34.DementiaBank English Pitt Corpus., Dementia.talkbank.org.

[CR35] 35.Alqahtani, N. et al. Deep Belief Networks (DBN) with IoT-Based Alzheimer’s Disease Detection and Classification. Appl. Sci.10.3390/app13137833 (2023). [Google Scholar]

PERMALINK

Multimodal Siamese networks for dementia detection from speech in women

Amel Ksibi

Ahlem Walha

Mohammed Zakariah

Manel Ayadi

Tagrid Alshalali

Nouf Abdullah Almujally

Abstract

Introduction

Fig. 1.

Literature review

Table 1.

Data collection

Data description

Fig. 2.

Fig. 3.

Data visualization

Analysis of audio waveforms

Fig. 4.

Fig. 5.

Single bark wave plot

Fig. 6.

Multi-Bark sound spectrogram

Fig. 7.

Mel-frequency cepstral coefficients (MFCC)

Fig. 8.

Zero crossing rate

Fig. 9.

Root mean squared energy & log power spectrogram

Fig. 10.

Word cloud of pre-processed transcripts

Fig. 11.

EDA analysis

Audio preprocessing

Fig. 12.

Transcript-based dataset preprocessing

Fig. 13.

Multimodal Siamese networks

Pseudo-Code for the multimodal Siamese network

Fig. 14.

Fig. 15.

Methodology

Fig. 16.

Model architecture design

Fig. 17.

Model compilation and training

Fig. 18.

Fig. 19.

Model performance analysis

Fig. 20.

Table 2.

Model evaluation

Evaluation metrics

Table 3.

Fig. 21.

Confusion matrix

Fig. 22.

Model Novel design

Fig. 23.

Core contributions

Comparison with baseline

Table 4.

Discussion

Conclusion

Prospective endeavors

Acknowledgements

Abbreviations

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES