Abstract
The growing and aging of the world population have driven the shortage of medical resources in recent years, especially during the COVID-19 pandemic. Fortunately, the rapid development of robotics and artificial intelligence technologies help to adapt to the challenges in the healthcare field. Among them, intelligent speech technology (IST) has served doctors and patients to improve the efficiency of medical behavior and alleviate the medical burden. However, problems like noise interference in complex medical scenarios and pronunciation differences between patients and healthy people hamper the broad application of IST in hospitals. In recent years, technologies such as machine learning have developed rapidly in intelligent speech recognition, which is expected to solve these problems. This paper first introduces IST's procedure and system architecture and analyzes its application in medical scenarios. Secondly, we review existing IST applications in smart hospitals in detail, including electronic medical documentation, disease diagnosis and evaluation, and human-medical equipment interaction. In addition, we elaborate on an application case of IST in the early recognition, diagnosis, rehabilitation training, evaluation, and daily care of stroke patients. Finally, we discuss IST's limitations, challenges, and future directions in the medical field. Furthermore, we propose a novel medical voice analysis system architecture that employs active hardware, active software, and human-computer interaction to realize intelligent and evolvable speech recognition. This comprehensive review and the proposed architecture offer directions for future studies on IST and its applications in smart hospitals.
Keywords: Automatic speech recognition, Smart hospital, Machine learning, Transcription, Diagnosis, Human-computer interaction
Graphical abstract
1. Introduction
The average lifespan of humans is increasing with the improvement of living standards and medical technology, leading to a rapidly aging population. The world's population aged 60 and over is expected to increase to 22% by 2050 [1], which poses numerous challenges to the healthcare system [2]. The aggravation of aging has caused an increase in healthcare costs and shortages in human and material resources. In addition, the unbalanced distribution of medical resources worldwide and the lack of advanced medical technology and equipment in underdeveloped areas, make some sudden diseases not treated timely and effectively [3]. Moreover, some early symptoms are often imperceptible, resulting in the aggravation of the diseases and the delay in the best treatment.
With the development of robotics and artificial intelligence (AI) technologies, machines can achieve more efficient and accurate disease diagnosis and assessment in some cases and replace nurses to assist patients in their lives, which alleviate the problem of insufficient medical resources. For example, intelligent image processing methods based on deep learning (DL) have been applied to processing X-ray, CT, ultrasound, and facial images for diagnosing diseases such as COVID-19 detection [[4], [5], [6]], paralysis assessment [7,8], and autism screening [9]. In addition, intelligent speech technology (IST) plays a critical role in smart hospitals because language is the most natural mean of communication between doctors and patients and contains much information, such as patients’ identity, age, emotion, and even symptoms of diseases [10].
IST refers to the use of machine learning (ML) methods to process human vocal signals to obtain information and realize human-machine communication. In recent years, speech signals research has developed rapidly with ML. IST contains many research areas, such as Automatic Speech Recognition (ASR) [11], Voiceprint Recognition, and Speech Synthesis. After years of development, IST has made significant progress and has gradually been applied in social life. For example, Apple's Siri, Google and Baidu's speech-based search services, and smart speakers [12] have all entered people's lives and provided convenience for us.
There are many review articles on speech technologies in medical applications, such as medical reporting [13], clinical documentation [14], speech impairment assessment [15], and speech therapy [16], healthcare [10,17]. However, we still require the review of state-of-the-art IST applications in smart hospitals. The smart hospital is the key to significantly improving the efficiency of medical behavior, alleviating the medical burden, and strengthening the robustness of the medical system in response to public health events such as the COVID-19 pandemic. Therefore, the application of IST in smart hospitals and smart healthcare needs to be reviewed for further development.
As shown in Fig. 1 , in addition to applying it in daily life, IST is a crucial part of smart hospitals to process vocal signals produced by healthy people and patients. It is gradually applied in medical and rehabilitation scenarios [17,18]. For example, IST can be used as a transcription tool to help doctors to record patient information such as personal information and chief complaints. It can also interactively guide patients to seek medical services. Moreover, IST can be an auxiliary tool for doctors to diagnose diseases preliminarily. At the same time, speech can identify patients’ emotional states to help doctors communicate better with them. Furthermore, IST combined with robotics, Internet of Things (IoT) technology, and 5G communication technology can support identifying and monitoring early symptoms of diseases, healthcare for the elderly, telemedicine, etc.
This paper mainly introduces the latest research progress and applications of IST in the healthcare field, summarizes and analyzes the existing research from the perspective of technical realization, and proposes the current challenges and future development directions. The rest of the paper is organized as follows. Section 2 gives the search methodology. Section 3 introduces the typical flow of intelligent speech signal processing, the system architecture of ASR, and an overview of the IST in applications of medical scenarios. The applications of IST in electronic medical documentation, disease diagnosis and evaluation, and human-medical equipment interaction are reviewed in Sections 4, 5, 6, respectively. Section 7 presents a case study of IST in stroke patients’ early recognition, diagnosis, rehabilitation training, evaluation, and daily care. The limitations of current speech technologies in the applications of smart hospitals and future directions are proposed in Section 8. Finally, we conclude this work in Section 9.
2. Search methodology
We performed the literature search on Web of Science and ProQuest. The literature search included all available English-language journal articles published in peer-reviewed journals up to July 2022 to ensure the quality of this review. Moreover, in order to target only papers related to IST and healthcare, the following keyword combinations are searched limiting in the title and abstract: (hospital OR medical) AND (intelligent OR smart OR technology) AND (speech). Only Review articles and Research articles are included.
Fig. 2 illustrates the article selection process. The initial search returned 3389 articles. 227 articles were retained after removing duplicates, non-English language, and irrelevant to healthcare by screening the titles and abstracts. Then, 187 articles were retained after screening the full text and excluded the studies irrelevant to IST, transcription, disease diagnosis, and human-medical equipment interaction. Finally, we included 173 articles after removing the less relevant articles and dataset papers. The 173 articles are classified by the year of publication, as shown in Fig. 3 . We also included 28 articles about the methods and algorithms of IST. Furthermore, 15 web pages of medical equipment using IST were also included.
3. Overview of intelligent speech technologies
Speech technology generally includes collecting, coding, transmitting, and processing speech signals. However, the speech signals of the doctors and patients collected in the hospital's public areas contain background noise. Moreover, some patients cannot speak and pronounce clearly due to illness or dialect. These issues bring challenges to the acquisition and processing of speech signals. We can upgrade the acquisition equipment for noise interference, such as using a microphone array, to suppress noise and acquire speech signals directionally [19,20]. In addition to noise suppression and collecting high-quality speech signals, the current research mainly focuses on their processing by state-of-the-art AI algorithms.
As shown in Fig. 4 , speech signal processing mainly includes pre-processing, feature extraction, and recognition [21]. Among them, feature extraction and recognition are the critical steps of IST. Currently, the latest AI technologies are mainly used to improve the performance of feature extraction and recognition. Therefore, without loss of generality, this section first introduces the general flow of intelligent speech processing, presents the architecture of an ASR system, and then summarizes the application of IST in the medical field.
3.1. Procedure of intelligent speech processing
3.1.1. Pre-processing
The pre-processing of speech signals is the first step in IST. The speech signals are generally real-time audio streams and time sequences. There may be many invalid and silent segments in the speech signals that need to be segmented and filtered through the voice activity detection algorithm. Only the valid speech segments are retained for subsequent processing [22]. Hence, the speech signals are usually processed by pre-emphasis, framing, and windowing.
To improve the high-frequency resolution of the speech signals, they are usually pre-emphasized by using the first-order Finite Impulse Response high-pass digital filter [23]. The speech signals are time-varying signals. However, speech signals have short-term characteristics and can be treated as steady-state signals because the movement of the human muscles during speaking is slow. Therefore, the speech signals are needed to be divided into frames before processing and regarded as many short-term speech frames of equal length. Overlaps between adjacent speech frames are set during framing to ensure the short-term reliability of speech signal features and avoid feature mutation between adjacent speech signals.
Windowing is usually performed on each frame of the speech signals to reduce the error between the related speech segments and the original signals caused by the truncation of the voice signals. The commonly used window functions include the Rectangular window, Hanning window, and Hamming window [24]. We can obtain the speech signal required for feature extraction by processing each frame of the speech signals using these window functions with low-pass characteristics.
3.1.2. Feature extraction
The second step of IST is feature extraction, which is also crucial in determining the performance of the intelligent voice processing system. The feature extraction of speech signals aims to convert them into time-varying feature vector sequences through feature value extraction algorithms. The features of speech signals include time domain features, frequency domain features, and other transform domain features [26].
-
a)
Time domain features
The common time domain features of speech signals include short-term amplitude, short-term energy, pitch period, pitch frequency, pitch, and zero-crossing rate. The short-term amplitude M(i) is:
(1) |
The short-term energy E(i) is:
(2) |
where y i(n) refers to the amplitude of the n-th sample in the i-th frame of the speech signals, N is the total number of frames after framing, and L is the frame length. M(i) and E(i) are mainly used to distinguish the unvoiced and voiced segments in speech pronunciation. The difference between M(i) and E(i) is that the former has fewer fluctuations than the latter.
The pitch period is the vibration period of the vocal tract when a person makes a sound and is the reciprocal of the fundamental frequency F 0 [48], which can be estimated from the speech signal using pitch detection algorithms. Pitch represents the level of the sound frequency, which can be expressed by F 0 as
(3) |
The zero-crossing rate Z(i) refers to the number of changes of the sign of the sampled value in each frame of the speech signal:
(4) |
where the symbolic function sgn[x] is:
(5) |
Z(i) is also used to distinguish between unvoiced and voiced and is often combined with E(i) for endpoint detection of speech segments, that is, the non-speech and speech segments. Z(i) is more effective than E(i) when there is considerable background noise.
-
b)
Frequency domain features
The spectrum of the speech signal can be obtained by converting each frame of a time-domain speech signal to the frequency domain using the Fast Fourier Transform (FFT). The spectrum contains the frequency and amplitude information of the speech signal. The spectrum can only show the feature of one frame of the speech signal. Therefore, we can combine the spectrum of all speech frames to form a spectrogram to observe the frequency domain features of the whole speech signal. The spectrogram contains three kinds of information: frequency, time, and energy.
-
c)
Other transform domain features
In addition to the characteristic parameters of speech signals commonly used in the time and frequency domains, researchers also use other characteristic parameters in the transform domain to improve the performance of the recognition. For example, the parameters in the transform domain can reflect the characteristics of people's vocal organs and auditory organs as speech features. Therefore, these feature parameters have a significant effect on speech signal recognition. Other domain features commonly used for speech signals include Mel Frequency Cepstral Coefficients (MFCC) [49,50], Discrete Wavelet Transform (DWT), Linear Prediction Coefficients (LPC), Linear Prediction Cepstral Coefficients, Perceptual Linear Prediction [51], and Line Spectral Frequency [26].
The above are common feature extraction methods in IST. However, in specific scenarios, we need to adjust the feature extraction method according to the type and characteristics of the collected signals and the performance of the speech recognition system. The extracted speech features are the input of speech recognition.
3.1.3. Recognition
Recognition based on the digital features of the speech signals is the final step in intelligent speech processing. There are many recognition algorithms. For example, Dynamic Time Warping (DTW) is a method for calculating the similarity of two temporal sequences. The similarity between the speech signal sample and the standard speech signal is obtained by comparing their feature sequences [52]. As shown in Fig. 5 , DTW borrows the idea of dynamic programming, the minimum distance D(i, j) between any time i and j of two sequences is
(6) |
where Dist(i, j) is the relative distance between two speech signals at times i and j, respectively. The distance generally is Euclidean distance. DTW requires less data and does not need pre-training, which is easy to implement and apply, and plays a vital role in small sample scenarios.
ML is the mainstream algorithm used in the current intelligent speech recognition. It utilizes the knowledge of probability and statistics and a dataset to train a model containing the mapping relationship between input and output to realize the feature recognition of speech signals. Table 1 shows the commonly used ML algorithms in medical speech signal processing. The traditional ML algorithms include Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support Vector Machine (SVM), etc. The DL algorithms include Deep Neural Network (DNN), Convolutional Neural Network (CNN), and the Long Short-Term Memory (LSTM) algorithm in the Recurrent Neural Network (RNN), etc. Some of the algorithms are briefly introduced as follows.
Table 1.
Algorithm | Characteristics | Ref. |
---|---|---|
GMM | The probability density function of observed data samples using a multivariate Gaussian mixture density. | [[27], [28], [29]] |
HMM | The Markov process is a double stochastic process in which there is an unobservable Markov chain defined by a state transition matrix. Each state of the chain is associated with a discrete or a continuous output probability distribution. | [[30], [31], [32], [33]] |
SVM | Support vector machine (SVM) is a binary classifier with advantages in few-shot classification, such as pathological voice detection. | [[34], [35], [36], [37]] |
DNN | Consists of fully connected layers and is popular in learning a hierarchy of invariant and discriminative features. Features learned by DNNs are more generalized than the traditional hand-crafted features. | [[38], [39], [40]] |
CNN | A convolutional layer is the main building block of CNNs. Designed for image recognition but also extended for speech technology. Using the spectrogram of speech signals to classify them. | [[41], [42], [43], [44]] |
LSTM | A type of recurrent neural network (RNN) architecture and well-suited to learn from experience to classify, process, and predict time series when there are very long-time lags of unknown size between important events. | [[45], [46], [47]] |
Gaussian model is a one-dimensional variable Gaussian distribution
(7) |
As shown in Fig. 6 , GMM refers to the superposition of multiple Gaussian models, and its variables are multi-dimensional vectors [53]. Then, the mixed Gaussian distribution p(x) is generally represented by the mean and covariance matrix of the variables
(8) |
where the multidimensional variables x=(x 1, x 2, x 3, …, x D), the covariance matrix is Σ = E[(x−μ)(x−μ)T], μ = E(x). This model is usually trained using an Expectation-Maximization algorithm to obtain the maximum expectation on the training set.
Markov chains represent the transition relationship of states. As shown in Fig. 7 , HMM adds the mapping from observations and states based on Markov chains [54]. a ij is the probability of transitioning from the current state to the next state
(9) |
b t is the probability that the current state maps to the observed value
(10) |
We can use this model to establish the mapping relationship between the observation value and the actual state sequences. Then, the internal state with the highest probability can be found as the model's output, with input speech features as the observation value.
As illustrated in Fig. 8 , the basic idea of SVM is to find an optimal hyperplane in a high-dimensional space for the segmentation of the binary classification problem. The hyperplane should ensure the minimum error rate of the classification [55]. The hyperplane in the high-dimensional space can be expressed as
(11) |
The training process of SVM is to find more suitable parameters W and b so that the hyperplane can better divide different categories.
In recent years, DL has considerably improved the performance of intelligent speech processing. As shown in Fig. 9 (a), the basic unit of a neural network is a neuron. In addition to an input layer and an output layer, a DNN has multiple hidden layers. Each layer contains numerous neurons, fully connected between adjacent layers to form a network [56]. The output vector v l of layer l is
(12) |
which is also the input vector of the next layer, where W l and b l are the weight matrix and the bias coefficient matrix of layer l, respectively. v l ∈ R Nl×1, W l ∈ R Nl×Nl−1, b l ∈ R Nl×1, Nl is the number of the neurons in layer l. Therefore, by adjusting the model's W l and b l through the training data, we can establish connections among neurons in the current and previous layers and finally obtain the mapping relationship between the input and output.
As shown in Fig. 9(b), CNN mainly consists of two components. One is a convolutional layer composed of filters to calculate the local feature maps. (h k)ij refers to the k-th output feature obtained by the input feature unit at position (i, j)
(13) |
where q represents the input feature unit, W k and b k represent the k-th filter and bias, respectively, obtained from the training data. Another component of the CNN is the pooling layer, which can reduce the dimensionality of each feature and retain only the more critical features. Finally, the last layer of the CNN is usually a fully connected layer, which is utilized to implement regression or classification tasks [57].
As illustrated in Fig. 9(c), the characteristic of the RNN is that it will be affected by the previous input while processing the current input, which can better process the time sequences [58]. The state transition and output of the hidden layer are:
(14) |
where s t and s t‒1 are the states of the hidden layer at time t and time t‒1, respectively, o t is the output of the network, W is the weight matrix converting state t‒1 to the input of state t, U and V are the weight matrices of input and output, respectively.
Typical RNN has the problem of vanishing gradient. Hence, researchers propose LSTM networks to solve this problem [59]. In addition to these recognition algorithms, many researchers have proposed other algorithms, such as Generative Adversarial Networks and Variational Auto Encoders, etc., which are less related to this paper and will not be repeated here.
The performance of speech recognition algorithms based on DL is far better than those based on traditional ML algorithms. Especially the performance of speech recognition has been dramatically improved by the end-to-end algorithm based on Attention and Transformer in recent years. However, due to insufficient pathological speech data, traditional ML algorithms are still primarily used in pathological speech recognition.
3.2. Automatic speech recognition system architecture
As one of the representative ISTs, speech recognition plays a vital role in healthcare. As shown in Fig. 10 , speech recognition has developed over a long time, from the initial DTW algorithm to the later GMM-HMM algorithm and then to the algorithm combining DNN with GMM in recent years. They all have a process model. Pathological speech recognition is mainly based on traditional front-end and back-end architectures. The conventional architecture of a speech recognition system is briefly introduced as follows.
As shown in Fig. 11 , a speech recognition system is generally divided into the front end and back end. The front end mainly completes speech signals’ acquisition, pre-processing, and feature extraction. The back end realizes the recognition of the obtained speech feature sequences and gets the final recognition result. Unlike traditional architectures, the latest end-to-end speech recognition algorithms can directly convert speech signals into text or classification results, significantly improving speech recognition performance. The applications of these novel algorithms in medical speech recognition are attracting much attention. The state-of-the-art methods will be introduced in the following sections.
3.3. Intelligent speech technology in medical scenarios
Healthcare services and treatment are indispensable in human society. The health system's capacity is essential to people's life and health. However, a hospital can only treat a limited number of patients daily due to its capacity, which is more severe in densely populated and undeveloped areas. We can utilize medical resources more efficiently if the non-medical workload of doctors is reduced and their work efficiency is improved [60]. As the aging population increases, patients' timely treatment, rehabilitation, and daily care are essential for their health. Many studies are trying to apply IST in different medical scenarios, such as speech-based assistants, telemedicine, and health monitoring [61], to change the working ways of medical staff and improve the efficiency of the medical system.
This paper reviews the applications of IST in smart hospitals, mainly from three aspects. (1) Use IST to recognize the doctors' voices and reduce their time spent in non-medical related work, which was studied by researchers from an early stage [62,63]. (2) IST is also utilized to process the patients’ speech signals to assist doctors in diagnosing and evaluating diseases [16]. This application has made significant breakthroughs in recent years with the development of ML and is also a hotspot of current research [64]. (3) IST is applied to medical equipment control to help doctors work efficiently [65,66]. These three aspects of applications are reviewed and summarized in the following three sections.
4. Speech recognition for electronic medical documentation
This section introduces the application of IST in electronic medical documentation, mainly including electronic medical record (EMR) transcription and electronic report generation. Then, we discuss some common issues of the existing medical transcription systems and typical solutions recently proposed. Finally, we present the critical indicators for evaluating the application effect of the transcription systems and their future directions.
Transcription refers to converting a speech signal into text using IST. The application of transcription in medical scenarios mainly refers to the generation of EMR and reports; that is, doctors' speech during diagnosis and pathological examination, the dialogue between doctors and patients, are all converted into text records. Transcription can reduce doctors’ burden of manual document editing, allowing them to focus on medical work and thereby improving their work efficiency [67].
In addition, transcription significantly affects many aspects, such as doctors’ enthusiasm for diagnosis, hospital treatment costs, and the treatment process [68]. Transcription also has some commercial value. Many companies have already developed some products. For example, Nuance designed an integrated healthcare system that could generate clinical records based on doctor-patient conversations [69]. A user survey has found that the system allowed physicians to devote more time to patients and their lives. Media Interface developed the Digital Patientendokumente product, which stored patient-related medical documents, nursing documents, and wills. This product allowed medical staff to review and sign patient documents quickly [70]. Unisound [71] and iFLYTEK [72] launched medical document entry systems, which effectively improved the work efficiency of medical staff. For instance, the entry system of iFLYTEK played an essential role in the fight against COVID-19.
4.1. Related studies and challenges
Using EMR transcription technology in medical scenarios has demonstrated apparent benefits. Table 2 shows that many researchers have used transcription technology to generate medical documents and investigated its application effects. They analyzed the accuracy, medical efficiency, and hospital cost of documentation by IST and proposed some problems and improvement methods.
Table 2.
Institute | Application scenario | Technical description | Application effect | Ref. |
---|---|---|---|---|
Zhejiang Provincial People's Hospital | Generate and extract pathological examination reports: 52h labeled pathological report recordings. | ASR system with Adaptive technology | Recognition rate = 77.87%; reduces labor costs; improves work efficiency and service quality | [81] |
Western Paraná State University | Audios collected from 30 volunteers | Google API and Microsoft API integrated with the web | Reduces the time to elaborate reports in the radiology | [89] |
University Hospital Mannheim | Lab test: 22 volunteers; Filed test: 2 male emergency physicians | IBM's Via-Voice Millennium Edition version 7.0 | The overall recognition rate is about 85%. About 75% in emergency medical missions | [77] |
Kerman University of Medical Sciences | Notes of hospitalized Patients from 2 groups of 35 nurses | Offline SR (Nevisa) Online SR (Speechtexter) | Users' technological literacy; Possibility of error report: handwritten < offline SR < online SR | [74] |
University of North Carolina School of Medicine | 6 radiologists dictated using speech-recognition software | PowerScribe 360 v4.0-SP2 reporting software | Near-significant increase in the rate of dictation errors; most errors are minor single incorrect words. | [79] |
King Saud University | CENSREC-1 database: 422 utterances spoken by 110 speakers | Interlaced derivative pattern | 99.78% and 97.30% accuracies using speeches recorded by microphone and smartphone | [18] |
KPR Institute of Engineering and Technology | 6660 medical speech transcription audio files and 1440 audio files from the RAVDESS dataset | Hybrid Speech Enhancement Algorithm | Minimum word error rates of 9.5% for medical speech and 7.6% for RAVDESS speech | [80] |
Simon Fraser University |
Co-occurrence statistics for 2700 anonymized magnetic resonance imaging reports | Dragon Naturally Speaking speech-recognition system; Bayes' theorem | Error detection rate as high as 96% in some cases | [83] |
Graz University of Technology | 239 clinical reports | Semantic and phonetic automatic reconstruction | Relative word error rate reduction of 7.74% | [25] |
Zhejiang University | Radiology Information System Records | Synthetic method | About 3% superior to the traditional MAP + MLLR | [49] |
Brigham and Women's Hospital | Records of 10 physicians who had used SR for at least 6 months | Morae usability software | Dictated notes have higher mean quality considering uncorrected errors and document time. | [75] |
Previous work has shown the effects and problems of transcription technology in medical document generation. For example, Ajami et al. investigated the previous medical transcription studies according to the usage scenario. Their results showed that the document generation performance was poor when the same vocabulary was used for different purposes. In addition, they found that although the use of speech recognition in the radiology report generation saved much time, the strict error checking in the later stage caused an increase in the overall turnaround time due to the high accuracy requirements of the report [73]. Peivandi et al. [74] and Poder et al. [13] also made a similar point that speech recognition accuracy was not as good as the accuracy of manual transcription. Although speech recognition has dramatically shortened the turnaround time of reports, doctors need to spend more time on dictation and correction due to the higher error rate of transcription [13].
Moreover, the advantages of electronic report generation are offset by the doctor's burden of verification and the risk of extra errors in the report. At the same time, previous studies have found considerable differences in the efficiency improvement of using transcription technology in different departments. By studying the previous work, Blackley et al. obtained some valuable and novel insights. For example, they found significant differences in the types and frequencies of words used when dictating and typing documents [75]. These differences may affect the quality of the documentation. They also found a lack of a unified and effective method for evaluating the impact of IST in medical scenarios [17].
The effects of transcription technology in medical scenarios include positive and negative aspects. The main advantages include reducing the turnaround time of most texts and quickly uploading the texts to the patient's electronic health record. Transcription also ensures the correctness of electronic documents in some scenarios that require multiple transcriptions and copies. In addition, transcription frees the doctors' eyes and hands, improves work efficiency in some scenarios, and brings them positive emotions [76]. Furthermore, in emergency medical missions, transcription technology can better meet the requirements for accurate time recordings of resuscitation than traditional methods [77]. Moreover, medical documents produced by transcription systems are more concise, standardized, and maintainable.
Negatively, there are potential recognition errors in the documents, resulting in the turnaround time not being shortened as expected in the scenarios with high accuracy requirements [78]. In addition, the delays in speech signal processing make doctors and patients lose patience with IST. Moreover, the background noise in public areas of hospitals, non-standard pronunciation, interruptions during speaking, and wearing surgical masks [79] will lead to decreased recognition accuracy and affect the mood of doctors and patients and their acceptance of IST [73].
4.2. Solutions for performance improvements
The critical question of medical transcription technology is continuous speech recognition. The current continuous speech recognition technology has high accuracy in most scenarios. However, improvements can be made in different processing stages of speech recognition to ensure accuracy and overcome the problems of IST in medical scenarios. Some improvement schemes have been proposed in several studies.
There are some methods to improve the adaptability of transcription systems. For the background noise problem, the microphone array combined with noise reduction algorithms can reduce the impact of the noise [19]. As shown in Fig. 12 (a), Gnanamanickam et al. proposed a cascaded speech enhancement algorithm using HMM to optimize the algorithm of nonlinear spectral subtraction, which improved the effect of medical speech recognition [80]. For different department scenarios, Duan et al. added the noise of the corresponding department when training the acoustic model [81]. They combined the knowledge transfer technique to improve the adaptability of the acoustic model and its recognition performance in specific application scenarios.
Regarding acoustic models, Muhammad et al. proposed a feature extraction technique less affected by noise, the interlaced derivative pattern, which achieved higher accuracy and shorter recognition time in a cloud computing-based speech medical framework [18]. In terms of language models, according to the different types of generated medical documents and the various probabilities of lexical occurrences, training the corresponding language models in a targeted manner is a method to improve recognition accuracy. As shown in Fig. 12(b), to make the model more adaptable in different departments, Wu et al. introduced a simplified Maximum Likelihood Linear Regression (MLLR) into the incremental Maximum A Posteriori (MAP) process to enable the parameters to be continuously adjusted according to the speech and text [49]. Speech transcription technology has also been applied to some products. For example, Unisound developed a pathology entry system for the radiology department [71]. The system can free the doctors' hands and allow them to enter the examination report while observing the image of the lesion. iFLYTEK also designed a medical document generation system for the dental department [82]. By wearing a small microphone, dentists can record information about the patient's condition during oral diagnosis.
Researchers also proposed methods to improve the quality of reports generated by transcription systems. For example, correction reports in electronic documents usually cause the problem of massive waste of resources [62]. Voll et al. proposed a text error correction scheme in post-processing for different medical documents to address this problem [83]. After the radiology report was generated, the frequency of different words appearing in the context was used to correct the report and mark the keywords, which was convenient for manual proofreading to shorten the document generation time [83]. In addition, Klann et al. proposed that using the Key-Val method to structure the report could reduce errors and improve its quality [84].
Sharing and security of electronic medical documents are also important issues. As shown in Fig. 13 (a), Muhammad et al. proposed an Internet-based cloud service architecture, which can realize unified management of electronic medical documents and facilitate communication between doctors and patients whenever and wherever possible. However, some scenarios have time delays and data security problems [18]. As shown in Fig. 13(b), Qin et al. proposed a hospital intelligence framework based on cloud computing and fog computing to alleviate the delay problem. The service nodes are deployed in the hospital, which can improve the quality of the voice transcription service [85] and ensure the security of the data. Singh et al. also presented an architecture similar to the one shown in Fig. 13(b). In addition, they proposed adding an IoT layer as a data source [86] so that guardians could obtain a real-time alert on students’ overall emotions in response to their stressful situations.
4.3. Summary and discussion
Accuracy is a significant indicator for electronic document and report generation systems used in medical scenarios [87]. We also should pay more attention to the efficiency improvement of hospitals after using these systems [88]. Therefore, four key evaluation indicators shown in Table 3 can be referred to when we evaluate these systems. The four indicators reflect the primary concerns of doctors and patients in actual medical scenarios.
-
a)
Report average turnaround time can measure the improvement of medical efficiency. Reducing this time is the primary purpose of applying transcription technology in medical scenarios.
-
b)
The average number of critical errors in the generated medical documents can measure the reliability of the transcription system. Healthcare is related to the patient's health, so an error-prone transcription system is unacceptable.
-
c)
The average word error rate of the generated documents will increase the time for medical staff to correct errors and affect the patient experience. We can quantitatively evaluate the above three indicators through the generated medical documents.
-
d)
Questionnaires and other methods need to be adopted to assess the user experience of the medical staff and patients in different departments and scenarios to serve as a benchmark for improving the transcription system.
Table 3.
Indicators | Definition | Meaning | Ref. |
---|---|---|---|
Report average turnaround time | Average time from the start of report generation to patient accessibility | Turnaround time reduction reflects medical efficiency improvement brought by the transcription system. | [13] |
Average number of critical errors | Number of medically misleading errors in generated documents | Reflects the reliability of the transcription system. | [87] |
Average word error rate | Number of typos in generated documents | Reflects the effect of the document and influences the satisfaction of doctors. | [13] [87] |
User experience of doctors and patients | Satisfaction of doctors and patients with all aspects of the generated documents | Improving work efficiency and user experience and reducing medical burden are goals of transcription systems. | [90] [91] |
The interaction between the system and doctors should be considered a priority in the future development of medical transcription systems. Firstly, a more reasonable transcription process can be designed according to different departments so medical staff can use transcription tools efficiently after training. Secondly, we need to apply new speech recognition solutions in other fields to medical scenarios to enhance the reliability of the electronic medical documentation system. Thirdly, it can also start from the post-processing stage to improve the system's error correction capability and adaptability in generating different types of documents to provide doctors convenience [25].
5. Pathological voice recognition for diagnosis and evaluation
This section introduces the application of IST in disease diagnosis (disease unknown) and evaluation (disease known) using pathological voice. Then, we discuss data types, features, and recognition algorithms of pathological voices from a technical perspective. Finally, we present IST's future directions and trends in medical diagnosis. Since diseases can affect the patient's normal speech, cause them to cough and sneeze, and even make their breathing voice abnormal, we have investigated speech signals and other voice types in this section for disease diagnosis and evaluation.
5.1. Related studies and voice signal types
People express their feelings and thoughts by speaking. Speaking is accomplished through coordinated movements of the head, neck, and abdomen muscles. Individuals who cannot correctly coordinate these muscles will produce pathological speech [156]. Pathological speech-based disease diagnosis uses speech signal processing technologies to judge whether the patient suffers from certain diseases or to evaluate the patient's condition.
As shown in Table 4 , many studies use speech technology to diagnose diseases that cause voice problems [157]. The diseases include Voice disorder [99], Acute decompensated heart failure [100], Alzheimer's Disease (AD) [104], Dysphonia [118], Parkinson's Disease (PD) [122,[125], [126], [127], [128]], Stroke [125,224], COVID-19 [130,132,135], Chronic Obstructive Pulmonary Disease [142,143], Aphasia [169,170,181], Tuberculosis (TB) [[147], [148], [149]], and organ lesions such as oral cancer [158], head and neck cancer [159], nodules, polyps, and Reinke's edema [95]. These studies are divided into four categories by diseases, including the otorhinolaryngology department, respiratory department, neurology department, and others, and are shown in the four sub-tables, respectively.
Table 4.
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
---|---|---|---|---|---|---|
Vocal Fold disorders | 41 HP, 111 Ps | SV/a/ | Jitter, RAP, Shimmer, APQ, MFCC Harmonic to Noise Ratio (HNR), SPI | ANN, GMM, HMM, SVM | Average classification rate in GMM reaches 95.2% | [92] |
KAY database: 53 HP, 94 Ps | SV/a/ | Wavelet-packet coefficients, energy, and entropy, selected by algorithms | SVM, KNN | Best accuracy = 91% | [93] | |
MEEI: 53 HP, 657 Ps | SV/a/ | Features based on the phenomena of critical bandwidths | GMM | Best accuracy = 99.72% | [94] | |
Benign Vocal Fold Lesions | MEEI: 53 HP, 63 Ps; SVD: 869 HP, 108 Ps; Hospital Universitario Príncipe de Asturias (HUPA): 239 HP, 85 Ps; UEX-Voice: 30 HP, 84 Ps | SV/a/and SS | MFCC, HNR, Energy, Normalized Noise Energy | Random-Forest (RF) and Multi-condition Training | Accuracies: about 95% in MEEI, 78% in HUPA, and 74% in SVD | [95] |
Voice disorder | MEEI: 53 HP, 372 Ps SVD: 685 HP, 685 Ps VOICED: 58 HP, 150 Ps |
SV/a/ | Fundamental Frequency (F0), jitter, shimmer, HNR | Boosted Trees (BT), KNN, SVM, Decision Tree (DT), Naive Bayes (NB) | Best performance achieved by BT (AUC = 0.91) | [96] |
KAY: 213 Ps | SV/a/ | Features are extracted through an adaptive wavelet filterbank | SVM | Sort six types of disorders successfully | [97] | |
KAY: 57 HP, 653 Ps samples from Persian native speakers: 10 HP, 19 Ps | SV/a/ | Same as above | SVM | Accuracy = 100% on both databases | [98] | |
30 HP, 30 Ps | SV/a/ | Daubechies' DWT, LPC | Least squares SVM | Accuracy >90% | [97] | |
MEEI: 53 HP, 173 Ps | SV/a/and SS | Linear Prediction Coefficients | GMM | Accuracy = 99.94% (voice disorder), Accuracy = 99.75% (running speech) | [101] | |
Dysphonia | Corpus Gesproken Nederlands corpus; EST speech database: 16 Ps; CHASING01 speech database: 5 Ps; Flemish COPAS pathological speech corpus: 122 HP, 197 Ps | SV/a/and SS. | Gammatone filterbank features and bottleneck feature | Time-frequency CNN | Accuracy ≈89% | [144] |
TORGO Dataset: 8 HP, 7 Ps | SS | Mel-spectrogram | Transfer learning based CNN model | Accuracy = 97.73%, | [145] | |
UA-Speech: 13 HP, 15 Ps | SS | Time- and frequency-domain glottal features and PCA-based glottal features | Multiclass-SVM | Best accuracy ≈ 69% | [146] | |
Pathological Voice | SVD: approximately 400 native Germans | SV/a/ | Co-Occurrence Matrix | GMM | Accuracy reaches 99% only by voice | [102] |
MEEI: 53 HP SVD: 1500 Ps |
SV/a/ | Local binary pattern, MFCC | GMM, extreme learning machine | Best accuracy = 98.1% | [103] | |
SVD | SV/a/,/i/,/u/ | Multi-center and multi-threshold based ternary patterns and Features selected by Neighborhood Component Analysis | NB, KNN, DT, SVM, bagged tree, linear discriminant | Accuracy = 100% | [108] | |
SVD: samples of speakers aged 15–60 years | SV/a/ | Feature extracted from spectrograms by CNN | CNN, LSTM | Accuracy reaches 95.65% | [109] | |
Cyst Polyp Paralysis | SVD: 262 HP, 244 Ps MEEI: 53 HP, 95 Ps |
SV/a/ | spectrogram | CNN (VGG16 Net and Caffe-Net), SVM | Accuracy = 98.77% on SVD | [105] |
SVD: 686 HP, 1342 Ps | SV/a/,/i/,/u/and SS | Spectro-temporal representation of the signal | Parallel CNN | Accuracy = 95.5% | [106] | |
Acute decompensated heart failure | 1484 recordings from 40 patients | SS | time, frequency resolution, and linear versus perceptual (ear) mode | Similarity calculation and Cluster algorithm | 94% of cases are tagged as different from the baseline | [100] |
Common vocal diseases | FEMH data: 588 HP Phonotrauma data: 366 HP |
SV/a/; | MFCC and medical record features | GMM and DNN, two stages DNN | Best accuracy = 87.26% | [107] |
Application of speech technology in pathological voice recognition and evaluation (neurology department) | ||||||
---|---|---|---|---|---|---|
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
Parkinson's Disease (PD) | UCI Machine Learning repository: 8 HP, 23 Ps | SV | Features selected by the Relief algorithm | SVM and bacterial foraging algorithm | Best accuracy = 97.42% | [119] |
98 S | SV/a/, SS | OpenSMILE features, MPEG-7 features, etc. | RF | Best accuracy ≈80% | [120] | |
UCI Machine Learning repository; Training: 20 HP, 20 Ps; Testing: 28 S | SV and SS | Wavelet Packet Transforms, MFCC, and the fusion | HMM, SVM | Best accuracy = 95.16%, | [121] | |
Group 1: 28 PD Ps Group 2: 40 PD Ps |
SS | Diadochokinetic sequences with repeated [pa], [ta], and [ka] syllables | Ordinal regression models | The [ka] model achieves agreements with human raters' perception | [122] | |
Istanbul acoustic dataset (IAD) [123]: 74 PH, 188 Ps Spanish acoustic dataset (SAD) [124]: 80 PH, 40 Ps |
SV/a/ | MFCC, Wavelet and Tunable Q-Factor wavelet transform, Jitter, Shimmer, etc. | Three DTs. | Best accuracy = 94.12% on IAD and = 95% on SAD | [125] | |
Training: 392 HP, 106 Ps Testing: 80 HP, 40 Ps |
SS | MFCC, Bark-band Energies (BBE) and F0, etc. | RF, SVM, LR, Multiple Instance Learning | The best model yielded 0.69/0.68/0.63/0.8 AUC for four languages | [126] | |
Istanbul acoustic dataset: 74 HP, 188 Ps | SV/a/ | MFCC, Deep Auto Encoder (DAE), SVM | LR, SVM, KNN, RF, GB, Stochastic Gradient Descent | Accuracy = 95.49% | [127] | |
PC-GITA: 50 HP, 50 Ps SVD: 687 HP, 1355 Ps Vowels dataset: 1676 S |
SV | Spectrogram | CNN | Best accuracy = 99% | [128] | |
Alzheimer's disease (AD) | 50 HP, 20 Ps | SS | Fractal dimension and some features selected by algorithms | MLP, KNN | Best accuracy = 92.43% on AD | [104] |
PD, Huntington's disease (HD), or dementia | 8 HP, 7 Ps | SS | Pitch, Gammatone cepstral coefficients, MFCC, wavelet scattering transform | Bi-LSTM | Accuracy = 94.29% | [110] |
Dementia | Two corpora recorded at the Hospital's memory clinic in Sheffield, UK; corpora 1: 30 Ps corpora 2: 12 Ps, 24 S | SS | 44 features (20 conversation analysis based, 12 acoustic, and 12 lexical) | SVM | Accuracy = 90.9% | [111] |
DementiaBank Pitt Corpus [112]: 98 HP, 169 Ps PROMPT Database [113]: 72 HP, 91 Ps | SS | Combined Low-Level Descriptors (LLD) features extracted by openSMILE [114] | Gated CNN | Accuracy = 73.1% on Pitt Corpus and = 74.1 on PROMPT | [115] | |
Dysarthria | UA-Speech: 12 HP, 15 CP Ps MoSpeeDi: 20 HP, 20 Ps PC-GITA database [116]: 45 HP, 45 PD Ps |
SS | Spectro-temporal subspace, MFCC, the frequency-dependent shape parameter | Grassmann Discriminant Analysis | Best accuracy = 96.3% on UA-Speech | [117] |
65 HP, 65 MS-positive Ps | SS | Seven features including Speech duration, vowel-to-recording ratio, etc. | SVM, RF, KNN, MLP, etc. | Accuracy = 82% | [118] | |
Distinguishing two kinds of dysarthria | 174 HP, 76 Ps | SV and SS | Cepstral peak prominence | classification and regression tree; RF; Gradient Boosting Machine (GBM); XGBoost | Accuracy = 83% | [155] |
Application of speech technology in pathological voice recognition and evaluation (respiratory department) | ||||||
---|---|---|---|---|---|---|
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
COVID-19 | 130 HP, 69 Ps | SV/a/and cough | feature sets extracted with the openSMILE, open-source software, and Deep CNN, respectively | SVM and RF | Accuracy ≈80% | [129] |
Sonda Health COVID-19 2020 (SHC) dataset [130]: 44 HP, 22 Ps | SV and SS | Features (glottal, spectral, prosodic) extracted by COVAREP speech toolkit | DT | Feature-task combinations accuracy >80% | [131] | |
Coswara: 490 HP, 54 Ps | SV/a/,/i/,/o/; | Fundamental, MFCC Frequency (F0), jitter, shimmer, HNR | SVM | Accuracy ≈ 97% | [132] | |
DiCOVA Challenge dataset and COUGHVID: Training: 772 HP, 50 Ps Validation: 193 HP, 25 Ps Testing: 233 S | Cough | MFCC, Teager Energy Cepstral Coefficients TECC |
Light GBM | The best result is 76.31% | [133] | |
MSC-COVID-19 database: 260 S | SS | Mel spectrogram | SVM & Resnet | Assess patient status by sound is effective | [134] | |
Integrated Portable Medical Assistant collected: 36 S | Cough and speech | Mel spectrogram, Local Ternary Pattern | SVM | Accuracy = 100% | [135] | |
COUGHVID: more than 20,000 S Cambridge Dataset [136]: 660 HP, 204 Ps; Coswara: 1785 HP, 346 Ps |
Cough | MFCC, spectral features, chroma features | Resnet and DNN | Sensitivity = 93%, specificity = 94% | [137] | |
COUGHVID: 1010 Ps; Coswara: 400 Ps; Covid19-Cough: 682 Ps | Cough, breathing cycles, and SS | Mel-spectrograms and cochlea-grams, etc. | DCNN, Light GBM | AUC reaches 0.8 | [138] | |
Cambridge dataset: 330 HP, 195 Ps; Coswara: 1134 HP, 185 Ps; Virufy: 73 HP, 48 Ps; NoCoCODa: 73 Ps | Cough | audio features, including MFCC, Mel-Scaled Spectrogram, etc. | Extremely Randomized Trees, SVM, RF, MLP, KNN, etc. | AUC reaches 0.95 | [139] | |
Coswara: 1079 HP, 92 Ps Sarcos: 26 HP, 18 Ps |
Cough | MFCC | LR, KNN, SVM, MLP, CNN, LSTM, Restnet50 | AUC reaches 0.98 | [140] | |
Coswara, ComParE dataset, Sarcos dataset | Cough, breathing, sneeze, speech | Bottleneck feature | LR, SVM, KNN, MLP | AUC reaches 0.98 | [141] | |
Chronic Obstructive Pulmonary Disease | 25 HP, 30 Ps | respiratory sound signals | MFCC, LPC, etc. | SVM, KNN, LR, DT, etc. | Accuracies of SVM and LR are 100% | [142] |
429 respiratory sound samples | respiratory sound signals | MFCC; Hilbert-Huang Transform (HHT)-MFCC; HHT-MFCC-Energy | SVM | Accuracy = 97.8% by HHT-MFCC-Energy | [143] | |
Tuberculosis (TB) | 21 HP, 17 Ps, cough recordings: 748 | Cough | MFCC, Log spectral energy | LR | AUC reaches 0.95 | [148] |
35 HP, 16 Ps, cough recordings:1358 | Cough | MFCC, Log-filterbank energies, zero-crossing-rate, Kurtosis | LR, KNN, SVM, MLP, CNN | LR outperforms the other four classifiers, achieving an AUC of 0.86 | [147] | |
TASK, Sarcos, Brooklyn datasets: 21 HP, 17 Ps Wallacedene dataset: 16 Ps Coswara: 1079 HP, 92 Ps; ComParE: 398 HP, 199 Ps |
Cough | MFCC | CNN, LSTM, Resnet50 | Resnet50 AUC: 91.90% CNN AUC: 88.95% LSTM AUC: 88.84% |
[149] |
Application of speech technology in pathological voice recognition and evaluation. (Others) | ||||||
---|---|---|---|---|---|---|
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
Juvenile Idiopathic Arthritis | 5 HP, 3 Ps | Knee Acoustical | Spectral, MFCC, or band power feature | Gradient Boosted Trees, neural network | Accuracy = 92.3% using GBT, Accuracy = 72.9% using neural network | [150] |
Stress | 6 categories of emotions, namely: Surprise, Fear, Neutral, Anger, Sad, and Happy | SS (facial expressions, content of speech) | Mel scaled spectrogram | Multinomial Naïve Bayes, Bi-LSTM, CNN | Assess students' stress by facial expressions and speech is effective | [86] |
Depression and Other Psychiatric Conditions | Gruop1: depression (DP) 27 S; Gruop2: other psychiatric conditions (OP) 12 S; Gruop3: normal controls (NC) 27 S |
SS | Features extracted by openSMILE and Weka program [151] | Five multiclass classifier schemes of scikit-learn | Accuracy = 83.33%, sensitivity = 83.33%, and specificity = 91.67% | [152] |
Depression | AVEC 2014 dataset: 84 S; TIMIT dataset | SS | TEO-CB-Auto-Env, Cepstral, Prosodic, Spectral, and Glottal, MFCC | Cosine similarity | Accuracy = 90% | [154] |
SV=Sustained vowel, SS=Spontaneous speech, Ps = Patients, HP=Healthy People, S=Subjects.
Most of the speech data used in these studies come from existing or small private datasets collected from medical institutions. For example, the frequently adopted pathological speech datasets include Parkinson's Telemonitoring Dataset [160], Saarbrucken Voice database (SVD) [161], Massachusetts Eye & Ear Infirmary (MEEI), TORGO [162], VOICED [163], University of California Irvine (UCI) Machine Learning repository [164], Universal Access Speech Database (UA-Speech) [165], Coswara database [166], the COUGHVID corpus [167], and Computational Paralinguistics ChallengE (ComParE) [190]. The above datasets contain pathological voices of many diseases, which provide convenience for IST-based diagnosis system research.
We can see from Table 4 that the accuracies of most studies are over 90% (sensitivity and specificity are not shown), which proves the feasibility of diagnosis through speech signals. Meanwhile, we can find that even if the same dataset is used to diagnose the same disease, there are significant differences between different studies. The main reasons include differences between identification methods and an implicit problem that different studies will screen the data in the same dataset. Therefore, it is not very meaningful to directly compare the recognition effects of different methods. Nevertheless, the trend and proposed methods can inspire our further research.
The studies surveyed have one or more recognition algorithms for processing pathological voices. As the statistical analysis results shown in Fig. 14 , about 81% of the articles try ML methods, achieving satisfactory accuracy in disease diagnosis. In recent years, the proportion of DL methods has increased (42%), but ML methods are still the primary ones. A crucial reason is that the data is difficult to meet the needs of state-of-the-art end-to-end recognition methods. Therefore, some studies have tried solutions such as data augmentation [139,140,149,171] and transfer learning [141,145,149] to solve this problem. The details of the diagnosis systems in these studies, including data sources, voice type, voice feature, classifier, and effect, can be found in Table 4.
For feature extraction, in addition to the common features in the time domain and frequency domain, some studies also try rare features [94,103,104,117,121,127,133,137,144] or use existing feature sets for research, such as OpenSMILE features [114,115,120,129,152], features extracted by the Weka program [152], and COVAREP speech toolkit [131], MPEG-7 features [120]. Moreover, some studies extract features by DL algorithms, dimension reduction algorithms [108,111,119,146], or heuristic algorithms [93,97,98].
The main types of voice data are sustained vowels (SV), spontaneous speech (SS) sentences, coughs, and breathing sounds. SV signals are generally processed by collecting the SV articulations of patients [92,96,101,106]. The SS processing-based method uses sentence-level features to collect patient speech or continuous pronunciation of a given text as experimental data [110]. Because the voice types of speech signals differ, their research also has apparent differences in feature extraction and recognition methods. In addition, some studies directly use general speech transcription systems to evaluate the condition of patients. Fig. 15 shows the statistical analysis results. 38% of articles adopted SV as the speech signal, which also obtained the highest average accuracy. 18% of articles used both SV and SS as the speech signal. Although more data types are utilized, there is no significant performance improvement. It shows how we can extract information from different types of voices and combine them effectively is also an issue. Other types of voice signals account for 28% because coughing, breathing, and sneezing are the main diagnostic signals in diagnosing respiratory-related diseases.
5.2. Conventional methods
The research using SV data as the object is generally carried out from the quality and frequency domain characteristic parameters of pathological voice signals. Wang et al. combined MFCC with six speech quality features (jitter, shimmer, harmonic-to-noise ratio (HNR), soft phonation index (SPI), amplitude perturbation quotient (APQ), and relative average perturbation (RAP)) of the SV pronunciation/a/to recognize the pathological voice. They used HMM, GMM, SVM, and Artificial Neural Networks (ANN) to conduct two-class comparison experiments and found that the GMM method has the best classification accuracy, with an accuracy rate of 95.2% [92]. Similar research work was done by Verde et al. [96]. The difference between them is that Verde et al. extracted features that included the fundamental frequency F 0 of the speech signal. In addition, they used a boosted tree algorithm as a classifier to conduct an experimental study on data selected from three different databases. Ali et al. also adopted the patient's SV pronunciation/a/as the research object. They proposed features based on the phenomena of critical bandwidths and combined them with HMM to detect vocal cord disorders, with an accuracy rate of more than 95% [94]. Baird et al. extracted features such as pitch, intensity, and HNR from the SV in the Dusseldorf Anxiety Corpus to assess the anxiety of patients [172]. Their results verified the effectiveness of using speech-based features to predict anxiety and showed better recognition performance of higher-level anxiety.
For research based on sentence-level speech data, in addition to the quality and time-frequency domain characteristics, the prosodic characteristics of sentence data are also an effective breakthrough. Kim et al. adopted the speech signal parameters of phonemes, prosody, and speech quality as the features. They predicted the intelligibility of aphasia speech in the Korean database Quality-of-Life Technology using Support Vector Regression (SVR) [173]. They also proposed a structured sparse linear model containing phonological knowledge to predict the speech intelligibility of patients with dysarthria [174]. Martínez et al. assessed dysarthria intelligibility using i-vectors extracted by factor analysis from the supervector of universal GMM [175]. After being evaluated by SVR and Linear Prediction, the speeches in Wall Street Journal 1 and UA-Speech databases were divided into four levels: very low, low, mid, and high. Kadi et al. also used a set of prosodic features selected by linear discriminant analysis combined with SVM and GMM, respectively, to classify dysarthria speech of the Nemours database into four severity levels and got the best classification rate of 93% [176]. Kim et al. classified pathological voice using the features of abnormal changes in prosody, phonological quality, and pronunciation at the sentence level. The pathological speeches of the NKI CCRT Speech Corpus and the TORGO databases were classified into two categories (intelligible and incomprehensible), and posterior smoothing was performed after classification [177]. These studies all make use of the characteristics of prosody. However, different languages have different pronunciations in prosody, which means that compared with the model obtained by SVs, the model trained by this method has low generalization ability.
There are many other studies based on speech recognition technology [178]. As shown in Fig. 16 , Liu et al. used speech recognition to extract features and then integrated traditional acoustic feature classification to assess the severity of the voice disorder [168]. Bhat et al. utilized a bidirectional LSTM network for binary classification of the speech intelligibility of dysarthria in the TORGO dataset [179]. They also compared the classification performances when using the features of MFCC, log filter banks, and i-vector. In addition, Dimauro et al. adopted Google's speech recognition system to convert patients' speech into text [180]. Their result showed that the PD group's recognition error rate was almost always higher than that of the normal group.
5.3. State-of-the-art methods
In addition to the traditional identification methods, some new methods have also been designed in recent years. As shown in Fig. 17 , Barbera et al. obtained the posterior probability of the patient's speech according to the acoustic model trained by a DNN network, compared it with the posterior probability of normal speech, and used the DTW algorithm to calculate the distance for classification [169,170]. The combination of DNN and vector matching method achieved a good result on the speech of word naming tests, which inspires us to integrate traditional methods with recent ones. For example, Lee et al. analyzed the distribution of frame-level posteriors produced by the DNN-HMM acoustic models [182]. They proposed an effective method for continuous speech utterances to extract dysphonia features from a specific set of discriminative phones with an ASR system.
Many studies also transformed the features of one-dimensional speech signals into two-dimensional features and used algorithms in the field of image recognition to investigate disease diagnosis. For example, Alhussein et al. converted pathological speech signals into spectrograms and then adopted CNNs for classification [105,106]. Qin et al. conducted a similar study, except that the input was a posterior probability map [181]. Muhammad et al. proposed to use the co-occurrence matrix feature combined with the GMM algorithm to classify pathological voices in the SVD database [102]. As shown in Fig. 18 , in their recent study, Muhammad et al. utilized the LSTM algorithm to complete the recognition task [109]. They achieved an accuracy of 95% based on using CNN to fuse the spectrogram features of the voice and Electroglottograph (EGG) signals. Turning speech signal recognition into image recognition allows us to learn from the solutions in the field of image recognition to solve problems better. However, we also need to be careful in dealing with the problem of strict data alignment and the increase in computation.
As shown in Fig. 18, information fusion using multimodal data from different systems is also one of the main strategies used in speech-based disease diagnosis [102,109]. Fig. 19 shows the identification flow chart of the COVID-19 detection system [138] designed by Ponomarchuk et al. The patients' voice signals and symptom information are the system's input. First, the speech signal is processed by subsystem 1 based on Deep CNN and spectrogram and by subsystem 2 based on LightGBM and VGGish features to obtain the ensemble average class probabilities. Next, the symptom information is processed by Logistic Regression (LR) algorithm in subsystem 3 to obtain the class probabilities. Then, the final result of the weighted output probability is obtained based on the fusion of the results of the three subsystems. Botha et al. proposed a fused system combining the classifier based on objective clinical measurements and the classifier based on cough audio using LR, which improved sensitivity, specificity, and accuracy [148].
Similarly, Lauraitis et al. used information from three modalities of sound, finger tapping, and self-administered cognitive testing for symptom diagnosis [110]. The authors in Refs. [107,129,152,184] also conducted similar studies with multimodal data, and the recognition performances of their systems were higher than that with only one type of data. The COVID-19 detection system [131] designed by Stasak et al. used speech signals as the only input modality in their system. However, the classification performance was effectively improved by adding a second-stage classifier to fuse the results of multiple first-stage classifiers. With the increase in data modalities, the amount of information, computation, and cost also increases, and the requirements for processing methods are also higher. For excellent performance, selecting several most effective modalities according to the experience of doctors may be a precondition for progress. As shown in Fig. 20 , Chowdhury et al. also designed a complex ensemble-based system to detect COVID-19 [139]. The trained classifier layer is composed of 10 ML classifiers, which will be ranked by technique for order preference similarity to ideal solution and Entropy in Multi-Criteria Decision-Making blocks. At last, the features selected by Recursive Feature Elimination with Cross-Validation are fed into the best classifier. This method improves the diagnostic accuracy and adaptability of the whole system.
In addition to the above methods, some new attempts at pathological voice recognition exist.
Pahar et al. adopted speech, cough, and breath signals and rare bottleneck feature for pathological voice recognition [141]. In addition, they utilized transfer learning for training the model on the cough sounds of patients without COVID-19. Then the recognition system was tested with multiple pathological voice datasets and multiple classifiers, which verified the feasibility of this scheme [141]. Later, they adopted three DL classifiers, Restnet50, CNN, and LSTM, to classify TB, Covid-19, and health by cough. Finally, to make DL based-approaches achieve excellent performance and robustness, they adopted a synthetic minority over-sampling technique and transfer learning to address the issues of the class imbalance and insufficiency of their dataset, respectively [149].
Moreover, Harimoorthy et al. proposed an adaptive linear kernel SVM algorithm with higher prediction accuracy than traditional ML algorithms such as KNN, Random Forest (RF), Adaptive Weighted Probabilistic, and other k-SVMs [185]. Kambhampati et al. also proposed a fundamental heart sound segmentation algorithm based on sparse signal decomposition. They tested the algorithm's performance using various ML algorithms (hidden semi-Markov model, multilayer perceptron (MLP), SVM, and KNN) on real-time phonocardiogram (PCG) and PCG in a standard database. The results showed that their algorithm outperformed traditional heart sound segmentation algorithms [186].
Furthermore, Saeedi et al. used a genetic algorithm to find the filter bank parameters for feature extraction. They achieved an accuracy of 100% in classifying normal and pathological voices when the tests were performed on two databases [97,98]. Qian et al. [134] and Huang et al. [187] tried the popular end-to-end models in speech recognition and Transformer-based models for pathological speech signals processing. Their recognition results were very consistent with the evaluation scales of the patients. Fig. 21 is the framework diagram of pathological speech-based diagnosis designed by Wahengbam et al. [183]. First, a deep pathological denoiser (DPD) block is obtained by training the silence and noise features using CNN and has an inverse STFT operation to revert the spectrum of the voice signal to the time domain. The DPD block is the first step of the group decision analogy. Then, the three kinds of features of the denoised pathological speech obtained from the wavelet transform of Amor, Bump, and Morse are sent to three decision-making subsystems, respectively. Each subsystem uses multiple 3D convolutional network models for predictions. Finally, the fusion and decision-making are performed using the proposed group decision analogy strategy, and the accuracy was increased from 80.59% to 97.7% [183].
The studies mentioned above have made innovations in the procedures of speech technology and brought us many inspirations. Innovations in data include using different types of voice signals, integrating data of multiple modalities such as SVs, continuous speech, cough, breath, finger tapping, EGG, and disease symptoms, and utilizing transfer learning to train models to avoid the problem of insufficient data. For feature selection, these studies try genetic algorithms, DL algorithms, recursive feature reduction methods, fractal dimension approach, etc. In terms of classifiers, in addition to improving ML, these studies also try group decision strategy, end-to-end models, etc.
5.4. Summary and discussion
SV and SS are the two main types of speech signals used for pathological speech-based disease diagnosis and evaluation. In addition, cough is also considered indispensable in diagnosing respiratory diseases and is usually treated as an SV. The SV method mainly uses the abnormality of the patients' pronunciation as the basis for judgment, which is relatively less complicated in the process of experiment and application. However, the SV method ignores that the patient's speech differs from that of healthy people. On the other hand, SS utilizes the entire sentence as the judgment basis and can more accurately identify the obvious abnormal speech of the patient. However, the training of the algorithm model and the procedure of this method are relatively complicated.
In addition, Bhosale et al. [188] and Casanova et al. [189] also used cough sounds to diagnose COVID-19. Moreover, Gosztolya et al. utilized SS to distinguish schizophrenia from bipolar disorder [190]. No matter what kind of speech data is adopted, researchers try to find more effective speech signal features based on different speech signals for disease diagnosis and evaluation and choose a more suitable recognition algorithm according to the actual effect [15].
Using voice technology for disease diagnosis and assessment can effectively reduce the burden on doctors and improve the efficiency of medical resources. The traditional diagnosis methods rely on medical instruments combined with the doctors' experience. However, the application of speech technology only depends on the patient's speech and a pre-training algorithm model incorporating medical experience, which is more objective than the traditional methods [192]. In addition, the combination of pathological voice recognition technology with the IoT [103], telemedicine technology, and other technologies [193] allows patients to diagnose anytime and anywhere, reducing medical costs dramatically. We can also integrate pathological voice recognition functions into wearable devices to monitor patients' health during daily activities [194,195] to diagnose a disease early and prevent its deterioration [196].
In the future, to achieve better diagnosis and evaluation results, in addition to exploring more effective features and recognition algorithms, it is crucial to design multimodal data fusion methods [115,184,191,197,198] and build richer pathological voice datasets.
6. Speech recognition for human-medical equipment interaction
Doctors need to operate various equipment in their work. In addition, patients often require equipment to assist in treatment and rehabilitation. Integrating voice technology into medical equipment can bring great convenience to doctors and patients in many medical scenarios [199]. For example, smart medicine boxes remind patients to take medicines on time, intelligent ward round systems help doctors collect patient information [200], and voice systems perform automatic postoperative follow-up visits [201,202]. This section discusses related studies on medical device control using IST and how they can help doctors and patients in different scenarios. Finally, we discuss the requirements and future directions for the application of voice technology in smart medical equipment and devices.
6.1. Doctor and patient assist
6.1.1. Doctor assist
With the rapid development of medical speech technology, many studies have attempted to use it to assist doctors in operating equipment. For example, intelligent minimally invasive surgical systems have been put into clinical use, and doctors can control the robotic arm to perform precise operations through voice [208]. Ren et al. tried to embed speech recognition in the laparoscopic holder [209]. The holder with the speech command recognition function can replace the assistant and give corresponding feedback according to the instructions of the chief surgeon [209]. In addition, Tao et al. proposed an intelligent interactive operating room to solve the problem that the attending doctor must be in a sterile and non-contact environment and cannot view the lesion image in time during the operation [210]. The doctor can remotely control the display instrument using speech commands to locate and observe the image of the lesion quickly.
Furthermore, as shown in Fig. 22 , Yoo et al. presented an intelligent voice assistant for the problem that the surgeon needs an assistant to check information during surgery continuously [211]. The voice assistant could recognize the proofreading speech of the attending doctor and compare it with the pre-input surgical information to ensure the smooth progress of the operation. Moreover, it also can remind the attending doctor of the length of the operation.
All these studies use IST to reduce the burden of inefficient labor on doctors and make the medical process more standardized and efficient.
6.1.2. Patient assist
In addition to using voice technology to assist doctors, many studies embed it in assistive devices for patients to help them have a better life quality. For example, intelligent wheelchairs integrated with voice technology are comprehensively studied. Li et al. designed a voice-controlled intelligent wheelchair that determines specific commands by comparing the appropriate distance of characteristic parameters [212]. As shown in Fig. 23 (a), Atrash et al. added a computer, a display, a laser rangefinder, and an odometer to the wheelchair to realize an intelligent wheelchair that can navigate autonomously according to voice commands [203]. Al-Rousan et al. realized the movement direction control of an electric wheelchair using voice command recognition based on wavelets and neural networks [213]. Wang et al. developed an intelligent wheelchair that used a brain-computer interface and speech recognition for coordinated control for mentally ill patients with dysarthria [214].
Moreover, as shown in Fig. 23(b), Almutairi et al. proposed smart glasses that can navigate visually impaired patients to destinations based on Global Positioning System, Global System for Mobile communication, Google maps, and speech recognition [204]. The smart glasses designed by Punith et al. can also help a person with a visual disability to read printed notes, which works with Optical Character Recognition and Text to Speech technology [215].
Many studies also focus on using speech recognition technology as a communication method for deaf patients or patients with speech disorders. For instance, Jothi et al. proposed a knowledge-based system to analyze the unstructured words pronounced by the patient and transform them into meaningful text [216]. Balaji et al. attempted to help dysarthric persons overcome difficulties in interacting with others by mapping their distorted speech to normal or less severe dysarthric speech [217]. As shown in Fig. 23(c), Lee et al. designed an assistive agent system to help the hard of hearing person understand others. When the patient is talking to others, the assist device uses IST to recognize other people's speech as text and utilizes speech synthesis technology to convert the text into speech, helping the patient to communicate normally [205,218].
Furthermore, Fontan et al. experimentally found that using speech technology can improve the gain of hearing aids and maximize speech intelligibility and hearing comfort [206]. Akbarzadeh et al. employed reinforcement learning to personalize compression settings of hearing aids for patients to avoid loudness discomfort [219]. In addition, as shown in Fig. 23(d), LAPUL utilizes voice technology to make the preconditioning trainer easy to use [207]. All these studies use IST to help patients live and overcome the problems caused by diseases, which is conducive to the recovery of patients.
6.2. Medical process optimization
In addition to assisting doctors and patients, IST will also bring changes to healthcare. Recently, COVID-19 has highlighted the importance of non-contact healthcare in the pandemic, and telemedicine is also an essential application of IST in healthcare [220]. For example, Amazon's Alexa Medical Dialogue systems have been widely used [221]. Nuance also has developed mature telemedicine solutions [222]. Liu et al. designed a low-cost cognitive tool to help children with autism exercise communication and color cognition skills [223]. As shown in Fig. 24 , Bu et al. utilized virtual reality technology to design a system with functions of oral expression, auditory comprehension, cognition, and comprehensive application to help post-stroke aphasia patients to perform rehabilitation training [224]. They conducted a clinical trial using the system, and the subjects affirmed the rehabilitation training effect of the system's language skills. In addition, Jokić et al. proposed a contact-free cougher recognition approach using smartphone audio recordings and metric learning [153]. Pahar et al. also designed cough spotting and cougher identification methods for long-term personalized cough monitoring [225]. They also proposed an automatic non-invasive cough detection method based on audio and acceleration signals of a smartphone [225,226]. These non-contact cougher identification methods are helpful during the COVID-19 pandemic and promote the development of IST-based healthcare-monitoring technology.
Patient care is also a vital application scenario [231,232]. For example, Olami developed a smart speech-based hospital bed card [227]. Doctors can use the card to enter and manage patient information, allowing them to read it conveniently. Patients can also utilize the card to communicate with nursing stations easily. There are also studies using robots for patient care. With the help of speech technology, nursing robots can meet patients' needs according to their instructions [233]. Zorarobotics’ designed an intelligent healthcare robot to help the elderly fight against loneliness and cognitive decline [229]. As shown in Fig. 25 (a), Zhang et al. designed the Pepper rehabilitation medical robot for patients with cognitive and motor function decline disorders. This robot can interact with patients to help them practice language skills and remind them to take medicine [230]. Some studies have been applied in traditional medical scenarios, such as guiding robots for patient admission consultation [234]. Fig. 25(b) shows the guidance robot from Shen Zhou Yun Hai [228], which can provide consultation and guidance services for patients seeking medical treatment and reduce the burden of the consultation desk.
Optimizing the medical process makes the medical system more intelligent and efficient. People can entrust the work requiring less knowledge of medical experts to intelligent devices and improve the medical process through IST.
6.3. Summary and discussion
Embedding speech recognition, interaction, and synthesis into the devices can make them smarter. However, due to the particularity of the medical scenario, the research on speech recognition of intelligent equipment needs to pay attention to some issues.
-
a)
A low misrecognition rate should be regarded as a prerequisite of the speech recognition system because misrecognition in medical scenarios costs high.
-
b)
The methods for rejection recognition of a non-device user's voice also must be taken to ensure a low misrecognition rate.
-
c)
The missed recognition rate will seriously affect the user experience.
Technically, speaker-dependent recognition can be used to ensure a low misrecognition rate. Speaker identification techniques [225] should be utilized to monitor the speaker who controls the equipment through voice commands in a noisy environment. In addition, the IST utilized for assisted control of medical equipment should ensure the highest priority of human intervention and the safety of medical operations.
In the future, smart medical devices can upload patient voices to the hospital's private cloud through IoT technology. The voice recognition model uses cloud computing technology to overcome the influence of different scenarios and language styles and improve speech recognition performance. In addition, Extended Reality technology integrated with voice technology will make telemedicine more realistic, effective, and acceptable [235].
7. Case study of intelligent speech technologies for stroke
Based on the above reviews of IST applied in the three medical scenarios, we conducted a case study for stroke recognition and rehabilitation assistance. In addition, we propose an IST application framework for stroke patients. In addition, we performed speech data collection and recognition experiments.
7.1. Speech technology for stroke patients
The medical system will become more intelligent with the development of IST, which enables the smart hospital to improve the efficiency of disease diagnosis, evaluation, surgery, and rehabilitation training. As depicted in Fig. 26 , we take the medical process of a stroke patient treatment as an example to introduce the application of IST. IST combined with 5G communication technology connects hospitals and patients. Furthermore, early symptoms of stroke patients can be recognized by wearable devices integrated with speech recognition technology, such as smart wristbands, smartphones, smart glasses, and home smart monitoring devices, which are. These smart devices can give an early warning and make an emergency call after recognizing the symptoms.
In addition, there will be speech-based medical transcription systems in ambulances and the emergency department, which can help record the entire treatment process and complete documentation of the patient's information. Moreover, in the operating room, medical equipment understanding doctors' voice commands can help them to view patient lesions, proofread surgical information, record, and remind in real-time, which improves the standardization and efficiency of the surgical process. Furthermore, rehabilitation training is a vital treatment scenario for stroke patients. The patients can control smart wheelchairs, rehabilitation robots, and other equipment by using voice commands to help themselves in rehabilitation training and daily activity assistance at hospitals or homes, improving their quality of life. At the same time, voice technology is utilized to quickly evaluate the rehabilitation effect and record it in the patient's EMR system to help doctors adjust rehabilitation training strategies.
7.2. Data collection and speech recognition experiment
In this work, we conducted a pathological speech recognition experiment on stroke patients. A data collection system was developed for the pathological voice collection of stroke patients. The experimental protocol was approved by the Medical Research Ethics Committee of Guangdong Provincial People's Hospital (approval number: KY-Z-2021-431-02). Stroke patients and healthy people read the sentence “People's Republic of China” in the data collection experiments. The recorded audio was stored as.wav files. The audio signal had a sampling frequency of 16 kHz and a sampling accuracy of 16 bits. The hardware adopted in the data collection system was a laptop with the Ubuntu 20.04 operating system and a Hikivision microphone (Portable Speaker Phone (DS. 65VA300B). Table 5 shows a dataset including 101 pathological sounds of stroke patients (male: 73, female: 27) and 101 healthy sounds of healthy people (male: 49, female: 52) used to validate stroke speech recognition.
Table 5.
Subject | Male | Female | Total |
---|---|---|---|
Stroke Patients | 74 | 27 | 101 |
Healthy People | 49 | 52 | 101 |
The stroke speech recognition experiment was conducted under Anaconda. Hardware platform information: the CPU was Intel (R) i7-9700, and the Memory was 16 GB. Software environment: Python (3.7.13), librosa (0.9.0), sklearn (0.24.2). 22-dimensional features consisting of energy, perturbation, rhythms, 18-dim MFCC, and fundamental frequencies were extracted by librosa.
Finally, the features were utilized to perform five-fold Cross Validation using RF, KNN, linear kernel SVM, and Naïve Bayes. Table 6 shows the results. TP, FN, TN, FP, ACC, and F1 are the average of five rounds. RF algorithm achieves the best F1 and recognition accuracy of 0.87. Correspondingly, the prediction time is the longest, about 55.4 ms. Nevertheless, this time is acceptable for the actual applications. The results verify the feasibility of using speech technology to recognize the voice of stroke patients.
Table 6.
Recognition algorithm | TP | FN | TN | FP | ACC | F1 | Time (ms) |
---|---|---|---|---|---|---|---|
RF | 17.4 | 2.6 | 17.6 | 2.8 | 0.87 | 0.87 | 55.4 |
KNN | 19.6 | 6.2 | 14 | 0.6 | 0.83 | 0.85 | 10.6 |
SVM (Linear Kernel) | 15.2 | 6.2 | 14 | 5 | 0.72 | 0.73 | 1.5 |
Naïve Bayes | 14.2 | 5.2 | 15 | 6 | 0.72 | 0.72 | <1 |
7.3. Voice assistants for stroke diagnosis
Voice assistants, such as Applications running on smartphones based on voice technology, can be used at home, in hospitals, and in community clinics. The voice assistants are helpful for the early recognition of stroke patients at home. Patients can diagnose by themselves or be assisted by family members. In addition, for the limited inspection equipment in community clinics, voice assistants can use the recognition model trained by the diagnostic data of large hospitals to better assist doctors in stroke diagnosis, which realizes the sharing of medical resources. Moreover, advanced voice assistants are also valuable in large hospitals. For example, with the development of AI technologies and the enrichment of speech datasets, voice assistants with more intelligent speech recognition algorithms can better adapt to complex environments and convert data accumulation into the accumulation of diagnostic experience. These voice assistants could achieve higher recognition accuracy than doctors in some cases and assist them in achieving the goal of smart hospitals.
8. Limitations and future directions
8.1. Limitations
Nowadays, speech technology is essential to traditional healthcare methods and systems. However, the application of speech technology in the medical system faces more challenges and needs to be continuously improved in the future. Therefore, we summarize some common issues in this section as potential future research directions.
8.1.1. Low adaptability and robustness
Most research on medical solutions based on voice technology is still under ideal conditions. However, the actual medical scenario is more complicated, and there will be more background noise, such as the sounds of the doctor's conversation and equipment beeping. Moreover, pronunciation differences from different doctors and the mixture of identifying results from multiple speakers are all potential factors that will degrade the performance of the speech recognition system. Medical application scenarios have high requirements for the adaptability and stability of the system, which is one considerable challenge to speech-based medical solutions.
8.1.2. Lacking high-quality pathological speech datasets
Although IST has excellent performances in some medical scenarios, pathological speech research lacks high-quality data for disease diagnosis and assessment of patients. Speech technologies are not transferable between different languages, thus slowing the study of pathological speech. Moreover, pathological speech datasets are even rarer due to patient discomfort and difficulty in speech data collection. The existing open-source pathological speech datasets are limited, small, and collected by different criteria. These limitations make it challenging to use DL methods to obtain high-accuracy recognition models.
8.1.3. Medical voice data privacy protection
In the digital age, IST brings ethical and legal issues while playing an essential role in medical scenarios. The speech data may be stolen or used to extract personal information, such as users’ identities, emotions, and other privacies. With the development of digital medical solutions, this problem will become prominent and challenge the popularization of speech-based medical solutions. Joint efforts from both legal and technical aspects should be made to protect the private speech data of patients and doctors. New privacy protection policies and techniques should be investigated comprehensively for the collection, transmission, storage, sharing, and use of medical speech data.
8.1.4. Conflict with traditional solutions
Applying IST in medical scenarios will inevitably change some traditional medical solutions. The resulting resistance of medical staff and patient acceptance are all issues that need to be considered. These problems also need to be resolved to promote the application of IST in smart hospitals.
8.1.5. Lacking unified and effective evaluation methods
The application of IST in medical scenarios must introduce performance indicators concerned in medical scenarios rather than evaluating the application effect only from the speech technical indicators. In the three scenarios described in this paper, medical transcription should focus on improving efficiency. Pathological speech recognition should emphasize the correct diagnosis rate and reduce misdiagnosis to diagnose patients’ diseases safely. In medical equipment control, it is necessary to pay attention to the recognition effect in the actual scenario due to its complexity. While emphasizing the recognition rate, the recognition time, rejection rate, and rejection-recognition rate also need to be considered. Therefore, the speech-based medical solution should be evaluated in terms of its effectiveness in real-world medical scenarios, focusing on the actual experiences of doctors and patients.
8.2. Future directions
Speech technology has great potential in the field of healthcare. With the progress of research and clinical application testing, traditional healthcare systems will be reshaped. For example, as shown in Fig. 27 , with the help of IoT technology, smart terminal devices such as microphone arrays, wearable devices, and medical assistive equipment can be connected to provide users with intelligent speech services such as transcription, diagnosis, and interaction. Meanwhile, these smart devices upload users’ speech data into a database for storage over the network and use cloud computing technology to organize and analyze the speech to improve recognition accuracy further.
As a result, doctors and patients can utilize smart devices for initial diagnosis and telemedicine to improve the accessibility and quality of healthcare [236]. Intelligent sensors, devices, and robots can assist in monitoring patients and rehabilitation. Doctors can view patients’ EMR and historical data collected by the sensors to analyze their health status systematically with speech-based auxiliary diagnosis systems. The intelligence of medical devices ensures the efficiency and standardization of surgical procedures. Researchers can use more comprehensive medical speech data to continuously optimize the recognition methods and improve the performance of the medical speech recognition systems. We can make efforts in the following four aspects to promote the development of smart hospitals and smart medicine to better serve patients, doctors, and researchers.
8.2.1. Improve the adaptability of IST
As illustrated in Fig. 27, improving the adaptability of speech processing-based smart equipment in different medical scenarios is a top priority because most current medical speech recognition systems are vulnerable to noise interference in practical application scenarios [19,20,183]. We can use a multi-channel microphone array and noise reduction algorithm to improve the quality of collected sound. Therefore, speech noise reduction algorithms and sound source identification algorithms related to microphone arrays [20] are popular research directions. Moreover, designing lightweight deep learning models [237] and selecting more critical features [147] will improve the real-time performance, adaptability, and robustness of the voice system in some medical scenarios, such as wearable devices-based patient monitoring. Furthermore, the ability to automatically switch personalized identification models based on the beeping of medical devices in different scenarios or patient-doctor interaction information is also worth exploring.
8.2.2. Explore the potential of multimodal signals and DL methods
DL accounts for only a small part of the current pathological voice recognition methods because a large amount of data is unavailable [92,93,101,104,110,[119], [124],129,145,146]. However, researchers can continuously explore the potential of DL methods such as Attention-based LSTM [45,47], end-to-end models [134], and Transformer models [187] and try advanced recognition algorithms to improve the performance of IST for medical applications. Moreover, the fusion of voice signals with signals of other modalities such as electroacoustic gate signals, EMR, X-ray images, and ultrasound [4,5] will be more valuable for disease diagnosis in smart hospitals. For example, combining the chest X-ray images and cough sounds-based COVID-19 non-contact classification methods will minimize severity and mortality rates during the pandemic [5,6,188,189]. Furthermore, algorithms in other domains can be used in speech signal processing, such as AlexNet, VGGNet, GoogLeNet, and ResNet in image recognition can be adopted in the spectrum of speech signals. In addition, it is worth studying how to extract more effective speech signal features using DL with the help of medical experts and use few-shot learning methods to deal with the problem of limited pathological speech data.
8.2.3. Integrate with emerging technologies
As illustrated in Fig. 27, state-of-the-art information technologies such as IoT, 5G communication, cloud computing, virtual reality, and blockchain can facilitate voice-based medical solutions in smart hospitals and healthcare. IoT can provide solutions for distributed data collection and real-time monitoring. Audio coding, 5G, and cloud computing can reduce data transmission latency and computation delay, deal with big-data issues, and drive telemedicine services forward [10]. Virtual reality technology can make voice-assisted diagnosis and rehabilitation systems more fun and increase patient engagement. In addition, blockchain technology can protect users’ privacy, facilitate the sharing of medical voice data, and promote the creation of open-source and high-quality voice datasets. Moreover, the voice analysis system should have self-learning and automatic optimization capabilities to obtain more intelligent and accurate recognition performance.
As shown in Fig. 28 , we propose a novel medical voice analysis system architecture based on active perception. With active hardware, active software, and human-computer interaction, this framework realizes the active data collection and recognition of medical speech, as well as the closed-loop optimization of the recognition model to improve the intelligence of the medical system. Furthermore, the framework integrates knowledge reasoning and self-learning into speech-based systems, promoting the evolution of more powerful voice assistants.
8.2.4. Think from the points of view of doctors and patients
When applying speech technology to healthcare, it is necessary to clarify the auxiliary role of speech technology because there are uncertainties in speech recognition, such as noise interference, language differences, pronunciation differences, etc. We also need to think about problems from the perspective of patients and doctors. For example, a patient with multiple diseases should be treated by doctors first. At the same time, IST combined with image processing [6] and other technologies can assist doctors in their work to address the uncertainties and improve the robustness and acceptability of AI-based medical systems. In most cases, with techniques such as semantic reasoning and knowledge mapping, machines can better understand the intentions of the speaking of doctors and patients. Meanwhile, a friendlier and ergonomically excellent voice assistant system can be designed according to the specific application scenario to reduce the difficulty of using speech technology. In the face of resistance from traditional healthcare, speech technologies with customer-oriented and user-friendly interfaces and multimodal human-computer interaction capability can be more persuasive and welcomed.
Moreover, the methods for assessing the effectiveness of IST should be designed by integrating the objective results from transcription, disease diagnosis, and medical equipment control, as well as the questionnaire results of the subjective experience of doctors and patients. Furthermore, in addition to doctors, patients, and scientists, the government, medical industry, and legal departments need to work together to build standardized application procedures and assessment systems for IST used in smart hospitals to alleviate the shortage and imbalance of medical resources.
9. Conclusion
The COVID-19 pandemic has made us realize that the traditional medical system struggles to provide high-quality care due to a lack of staff and other medical resources. IST has brought unprecedented opportunities for health systems to address this issue. This paper first comprehensively reviews the application of IST in smart hospitals, including electronic medical document transcription, pathological voice recognition, and medical process optimization through human-medical equipment interaction. Then, we discuss how a speech-based healthcare system facilitates the early recognition, rehabilitation assistance, and intelligent assessment of stroke patients and introduce the diagnosis results of 101 stroke patients using their pathological speech data. The literature review shows that the study of IST in medical scenarios has attracted more and more scholars’ attention and achieved promising results. State-of-the-art AI models, such as models based on Attention or Transformer, are applied to speech recognition. Moreover, the multimodal fusion of speech and other signals improves recognition accuracy and system robustness. However, these results are mainly from pilot projects or small datasets. Therefore, adequate research and validations are needed before clinical applications.
Furthermore, we discuss some limitations to the development of IST in the medical field, such as the scarcity of available high-quality datasets, privacy issues, and lack of unified and effective evaluation methods. Finally, we present some future directions for medical speech technology. We also propose a novel active perception concept-based medical voice analysis system architecture, which employs active hardware, active software, and human-computer interaction to realize an intelligent and evolvable speech recognition system for smart hospitals.
The comprehensive review of the applications in smart hospitals provides helpful information for researchers on this topic. In addition, the summarized limitations and proposed future directions could give inspiration for future studies. Moreover, the case study of IST for stroke gives a reference for a full-process application of IST in various medical behaviors. Furthermore, the proposed active perception concept and the speech analysis system architecture can advance the IST applications in smart hospitals and offer an opportunity to apply IST in other scenarios with noise interference, such as airports, railway stations, and shopping malls.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported in part by the National Key R&D Program of China under Grant 2020AAA0109603, the National Natural Science Foundation of China under Grants 61873066 and 62173090, and the Zhi Shan Scholars Program of Southeast University under Grant 2242020R40096. We would like to thank the Reviewers for taking the time and effort necessary to review the manuscript. We sincerely appreciate all valuable comments and suggestions that help us improve the quality of the manuscript.
References
- 1.World Health Organization . 2017. 10 Facts on Ageing and Health.https://www.who.int/news-room/fact-sheets/detail/10-facts-on-ageing-and-health [Online]. Available: [Google Scholar]
- 2.World Health Organization . 2020. Decade of Healthy Ageing: Baseline Report.https://www.who.int/publications/i/item/9789240017900 [Online] Available: [Google Scholar]
- 3.Turkmen H.I., Karsligil M.E. Advanced computing solutions for analysis of laryngeal disorders. Med. Biol. Eng. Comput. 2019;57:2535–2552. doi: 10.1007/s11517-019-02031-9. [DOI] [PubMed] [Google Scholar]
- 4.Bhosale Y.H., Patnaik K.S. Application of deep learning techniques in diagnosis of Covid-19 (coronavirus): a systematic review. Neural Process. Lett. 2022:1–53. doi: 10.1007/s11063-022-11023-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bhosale Y.H., Zanwar S., Ahmed Z., Nakrani M., Bhuyar D., Shinde U. Deep convolutional neural network based Covid-19 classification from radiology X-Ray images for IoT enabled devices. Int. Conf. Adv. Comput. Commun. Syst. 2022:1398–1402. [Google Scholar]
- 6.Bhosale Y.H., Patnaik K.S., PulDi-Covid Chronic obstructive pulmonary (lung) diseases with COVID-19 classification using ensemble deep convolutional neural network from chest X-ray images to minimize severity and mortality rates. Biomed. Signal Proces. 2023;81 doi: 10.1016/j.bspc.2022.104445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sajid M., Shafique T., Baig M.J.A., Riaz I., Amin S., Manzoor S. Automatic grading of palsy using asymmetrical facial features: a study complemented by new solutions. Symmetry. 2018;10:242. [Google Scholar]
- 8.Guo Z., Shen M., Duan L., Zhou Y., Xiang J., Ding H., Chen S., Deussen O., Dan G. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recogn. 2017. Deep assessment process: objective assessment process for unilateral peripheral facial paralysis via deep convolutional neural network; pp. 135–138. [Google Scholar]
- 9.Kanhirakadavath M.R., Chandran M.S.M. Investigation of eye-tracking scan path as a biomarker for autism screening using machine learning algorithms. Diagnostics. 2022;12:518. doi: 10.3390/diagnostics12020518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Latif S., Qadir J., Qayyum A., Usama M., Younis S. Speech technology for healthcare: opportunities, challenges, and state of the art. IEEE Rev. Biomed. Eng. 2021;14:342–356. doi: 10.1109/RBME.2020.3006860. [DOI] [PubMed] [Google Scholar]
- 11.Chiu C.C., Sainath T.N., Wu Y., Prabhavalkar R., Nguyen P., Chen Z., Kannan A., Weiss R.J., Rao K., Gonina E., et al. IEEE Int Conf Acoust Speech Signal Process Proc. IEEE; 2018. State-of-the-art speech recognition with sequence-to-sequence models; pp. 4774–4778. [Google Scholar]
- 12.Herff C., Schultz T. Automatic speech recognition from neural signals: a focused review. Front. Neurosci. 2016;10:429. doi: 10.3389/fnins.2016.00429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Poder T.G., Fisette J.-F., Dery V. Speech recognition for medical dictation: overview in Quebec and systematic review. J. Med. Syst. 2018;42:89. doi: 10.1007/s10916-018-0947-0. [DOI] [PubMed] [Google Scholar]
- 14.Blackley S.V., Huynh J., Wang L., Korach Z., Zhou L. Speech recognition for clinical documentation from 1990 to 2018: a systematic review. J. Am. Med. Inf. Assoc. 2019;26:324–338. doi: 10.1093/jamia/ocy179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Mamatha V.L. A systematic review of machine learning based automatic speech assessment system to evaluate speech impairment. Int. Conf. Intell. Sustain. Syst. 2020:175–185. [Google Scholar]
- 16.Jamal N., Shanta S., Mahmud F., Sha’abani M.N.A.H. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: a Review. AIP Conf. Proc. 1883;2017 [Google Scholar]
- 17.Johnson M., Lapkin S., Long V., Sanchez P., Suominen H., Basilakis J., Dawson L. A systematic review of speech recognition technology in health care. Bmc. Med. Inform. Decis. 2014;14:94. doi: 10.1186/1472-6947-14-94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Muhammad G. Automatic speech recognition using interlaced derivative pattern for cloud based healthcare system. Cluster Comput. 2015;18:795–802. [Google Scholar]
- 19.Ishfaque A., Kim B. Fly Ormia Ochracea inspired MEMS directional microphone: a review. IEEE Sensor. J. 2018;18:1778–1789. [Google Scholar]
- 20.Rahaman A., Kim B. Microscale devices for biomimetic sound source localization: a review. J. Microelectromech. Syst. 2022;31:9–18. [Google Scholar]
- 21.Ahmad A.M., Ismail S., Samaon D.F. IEEE Int. Symp. Commun. Inf. Technol. . IEEE; 2004. Recurrent neural network with backpropagation through time for speech recognition; pp. 98–102. [Google Scholar]
- 22.Keerio A., Mitra B.K., Birch P., Young R., Chatwin C. On preprocessing of speech signals. Int. J. Signal Process. 2009;5:216–222. [Google Scholar]
- 23.Al-Alaoui M.A., Al-Kanj L., Azar J., Yaacoub E. Speech recognition using artificial neural networks and hidden Markov models. IEEE Multidiscip. Eng. Educ. Mag. 2008;3:77–86. [Google Scholar]
- 24.Othman A.M., Riadh M.H. Speech recognition using scaly neural networks. Int. J. Electr. Comput. Eng. 2008;2:211–216. [Google Scholar]
- 25.Petrik S., Drexel C., Fessler L., Jancsary J., Klein A., Kubin G., Matiasek J., Pernkopf F., Trost H. Semantic and phonetic automatic reconstruction of medical dictations. Comput. Speech Lang. 2011;25:363–385. [Google Scholar]
- 26.Alim S.A., Rashid N.K.A. From Natural to Artificial Intelligence - Algorithms and Applications; IntechOpen London, UK: 2018. Some Commonly Used Speech Feature Extraction Algorithms. [Google Scholar]
- 27.Chehrehsa S., Moir T.J. Speech enhancement using maximum A-posteriori and Gaussian mixture models for speech and noise periodogram estimation. Comput. Speech Lang. 2016;36:58–71. [Google Scholar]
- 28.Frigieri E.P., Campos P.H.S., Paiva A.P., Balestrassi P.P., Ferreira J.R., Ynoguti C.A. A mel-frequency cepstral coefficient-based approach for surface roughness diagnosis in hard turning using acoustic signals and Gaussian mixture models. Appl. Acoust. 2016;113:230–237. [Google Scholar]
- 29.Kumari R.S.S., Nidhyananthan S.S., G A. Fused Mel feature sets based text-independent speaker identification using Gaussian mixture model. Procedia Eng. 2012;30:319–326. [Google Scholar]
- 30.Ghoniem R.M., Shaalan K. A novel Arabic text-independent speaker verification system based on fuzzy hidden markov model. Procedia Comput. Sci. 2017;117:274–286. [Google Scholar]
- 31.Shahin I. Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng. Appl. Artif. Intell. 2014;35:316–323. [Google Scholar]
- 32.Zeinali H., Sameti H., Burget L., Černocký J.H. Text-dependent speaker verification based on i-vectors, neural networks and hidden markov models. Comput. Speech Lang. 2017;46:53–71. [Google Scholar]
- 33.Bryan J.D., Levinson S.E. Autoregressive hidden markov model and the speech signal. Procedia Comput. Sci. 2015;61:328–333. [Google Scholar]
- 34.Papandrea P.J., Frigieri E.P., Maia P.R., Oliveira L.G., Paiva A.P. Surface roughness diagnosis in hard turning using acoustic signals and support vector machine: a PCA-based approach. Appl. Acoust. 2020;159 [Google Scholar]
- 35.Das B.R., Sahoo S., Panda C.S., Patnaik S. Part of speech tagging in Odia using support vector machine. Procedia Comput. Sci. 2015;48:507–512. [Google Scholar]
- 36.Matsumoto M., Hori J. Classification of silent speech using support vector machine and relevance vector machine. Appl. Soft Comput. 2014;20:95–102. [Google Scholar]
- 37.Lahmiri S., Shmuel A. Detection of Parkinson's disease based on voice patterns ranking and optimized support vector machine. Biomed. Signal Proces. 2019;49:427–433. [Google Scholar]
- 38.Badino L., Canevari C., Fadiga L., Metta G. Integrating articulatory data in deep neural network-based acoustic modeling. Comput. Speech Lang. 2016;36:173–195. [Google Scholar]
- 39.Chen L.L., Chen J.J. Deep neural network for automatic classification of pathological voice signals. J. Voice. 2022;36:288. doi: 10.1016/j.jvoice.2020.05.029. e15−288.e24. [DOI] [PubMed] [Google Scholar]
- 40.Hwang I., Park H.M., Chang J.H. Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection. Comput. Speech Lang. 2016;38:1–12. [Google Scholar]
- 41.Shahin I., Nassif A.B., Hindawi N. Speaker identification in stressful talking environments based on convolutional neural network. Int. J. Speech Technol. 2021;24:1055–1066. [Google Scholar]
- 42.Issa D., Demirci M.F., Yazici A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Proces. 2020;59 [Google Scholar]
- 43.Farsiani S., Izadkhah H., Lotfi S. An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput. Electr. Eng. 2022;100 [Google Scholar]
- 44.Hireš M., Gazda M., Drotár P., Pah N.D., Motin M.A., Kumar D.K. Convolutional neural network ensemble for Parkinson's disease detection from voice recordings. Comput. Biol. Med. 2022;141 doi: 10.1016/j.compbiomed.2021.105021. [DOI] [PubMed] [Google Scholar]
- 45.Fernández-Díaz M., Gallardo-Antolín A. An attention Long Short-Term Memory based system for automatic classification of speech intelligibility. Eng. Appl. Artif. Intell. 2020;96 [Google Scholar]
- 46.Lindemann B., Müller T., Vietz H., Jazdi Na, Weyrich M. A survey on long short-term memory networks for time series prediction. Procedia CIRP. 2021;99:650–655. [Google Scholar]
- 47.Gallardo-Antolín A., Montero J.M. On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification. Neurocomputing. 2021;456:49–60. [Google Scholar]
- 48.Cheng Y., Leung H.C. Speaker verification using fundamental frequency. Int. Conf. Spok. Lang. Process. 1998:1–4. [Google Scholar]
- 49.Wu F., Wang X., Ye Z. The speaker and content adaptation in radiology information system. Appl. Mech. Mater. 2012;195–196:859–863. [Google Scholar]
- 50.Davis S., Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980;28:357–366. [Google Scholar]
- 51.Hermansky H. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 1990;87(4):1738–1752. doi: 10.1121/1.399423. [DOI] [PubMed] [Google Scholar]
- 52.Rakthanmanon T., Campana B., Mueen A., Batista G., Keogh E. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. ACM; 2012. Searching and mining trillions of time series subsequences under dynamic time warping; pp. 262–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Reynolds D.A. Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 1995;17:91–108. [Google Scholar]
- 54.Rabiner R.L. A tutorial on hidden Markov models and selected applications in speech recognition, P. IEEE ASME Trans. Mechatron. 1989;77:257–286. [Google Scholar]
- 55.Muller K.R., Mika S., Ratsch G., Tsuda K., Scholkopf B. An introduction to kernel-based learning algorithms. IEEE Trans. Neural Network. 2001;12:181–201. doi: 10.1109/72.914517. [DOI] [PubMed] [Google Scholar]
- 56.Deng L., Hinton G., Kingsbury B. IEEE Int. Conf. Acoust. Speech Signal Process. Proc. IEEE; 2013. New types of deep neural network learning for speech recognition and related applications: an overview; pp. 8599–8603. [Google Scholar]
- 57.O'Shea K., Nash R. 2015. An Introduction to Convolutional Neural Networks. arXiv preprint arXiv:1511.08458. [Google Scholar]
- 58.Mikolov T., Karafiát M., Burget L., Cernocký J., Khudanpur S. Recurrent neural network based language model. Proc. Annu. Conf. Int. Speech. Commun. Assoc. 2010:1045–1048. [Google Scholar]
- 59.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- 60.Dawson L., Johnson M., Suominen H., Basilakis J., Sanchez P., Estival D., Hanlen L. A usability framework for speech recognition technologies in clinical handover: a pre-implementation study. J. Med. Syst. 2014;38:56. doi: 10.1007/s10916-014-0056-7. [DOI] [PubMed] [Google Scholar]
- 61.Sun L., Li M. Sports and health management using big data based on voice feature processing and internet of things. Sci. Program. 2021;2021 [Google Scholar]
- 62.Debruijn L.M., Verheijen E., Hasman A., Vannes F.L., Arends J. Speech interfacing for diagnosis reporting systems - an overview. Comput. Methods Progr. Biomed. 1995;48:151–156. doi: 10.1016/0169-2607(95)01676-k. [DOI] [PubMed] [Google Scholar]
- 63.Teichgraber U.K.M., Ehrenstein T., Lemke M., Stobbe H., Hosten N., Keske U., Felix R. Automatic speech recognition for report generation in computed tomography. Rofo.-Fortschr. Rontg. 1999;171:396–399. doi: 10.1055/s-1999-11094. [DOI] [PubMed] [Google Scholar]
- 64.Fager S.K. Speech recognition as a practice tool for dysarthria. Semin. Speech Lang. 2017;38:220–228. doi: 10.1055/s-0037-1602841. [DOI] [PubMed] [Google Scholar]
- 65.Landau J.A., Norwich K.H., Evans S.J. Automatic speech recognition-can it improve the man-machine interface in medical expert systems? Int. J. Bio-Inspired Comput. 1989;24:111–117. doi: 10.1016/0020-7101(89)90013-5. [DOI] [PubMed] [Google Scholar]
- 66.Giorgino T., Azzini I., Rognoni C., Quaglini S., Stefanelli M., Gretter R., Falavigna D. Automated spoken dialogue system for hypertensive patient home management. Int. J. Med. Inf. 2005;74:159–167. doi: 10.1016/j.ijmedinf.2004.04.026. [DOI] [PubMed] [Google Scholar]
- 67.Shagoury J. In: Adv. In Speech Recogn.: Mob. Environ. Neustein A., editor. Springer US; Boston, MA: 2010. Dr. multi-task: using speech to build up electronic medical records while caring for patients; pp. 247–273. (Call Cent. And Clinics). [Google Scholar]
- 68.Rozmus M. Transcription makeover. Virginia's Rockingham Memorial Hospital improves its clinical documentation process by implementing advanced speech-recognition technologies. Health Manag. Technol. 2010;31:20–21. [PubMed] [Google Scholar]
- 69.Nuance. Heading off the Physician Shortage: the Role Ambient Clinical Intelligence Must Play. [Online] Available: https://(whatsnext.nuance.com/healthcare/the-role-ai-must-play-in-heading-off-the-physician-shortage/).
- 70.Patientendokumente Digitale. MediaInterface. https://www.mediainterface.de/wie_wir_sie_unterstuetzen/digitale_patientendokumente [Online] Available:
- 71.Meishan Traditional Chinese Medicine Hospital launched Unisound intelligent medical voice system. https://m.sohu.com/a/237842928_401562#read (in Chinese). [Online] Available:
- 72.iFLYTEK: Medical care, education http://sh.people.com.cn/big5/n2/2020/0225/c396182-33828035.html justice, government services... Ten anti-epidemic artifacts (in Chinese). [Online] Available:
- 73.Yaa K.C., Claude P., Harrison W., Edward G., Shilo A., Christoph L. Electronic health record interactions through voice: a review. Appl. Clin. Inf. 2018;9:541–552. doi: 10.1055/s-0038-1666844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Peivandi S., Ahmadian L., Farokhzadian J., Jahani Y. Evaluation and comparison of errors on nursing notes created by online and offline speech recognition technology and handwritten: an interventional study. Bmc. Med. Inform. Decis. 2022;22:96. doi: 10.1186/s12911-022-01835-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Blackley S.V., Schubert V.D., Goss F.R., Al Assad W., Garabedian P.M., Zhou L. Physician use of speech recognition versus typing in clinical documentation: a controlled observational study. Int. J. Med. Inf. 2020;141 doi: 10.1016/j.ijmedinf.2020.104178. [DOI] [PubMed] [Google Scholar]
- 76.Heinzer B. Essential elements of nursing notes and the transition to electronic health records: the migration from narrative charting will require creativity to include essential elements in EHRs. J. Healthc. Inf. Manag. 2010;24:53–59. [Google Scholar]
- 77.Groschel J., Philipp F., Skonetzki S., Genzwurker H., Wetter T., Ellinger K. Automated speech recognition for time recording in out-of-hospital emergency medicine-an experimental approach. Resuscitation. 2004;60:205–212. doi: 10.1016/j.resuscitation.2003.10.006. [DOI] [PubMed] [Google Scholar]
- 78.Hodgson T., Magrabi F., Coiera E. Efficiency and safety of speech recognition for documentation in the electronic health record. J. Am. Med. Inf. Assoc. 2017;24:1127–1133. doi: 10.1093/jamia/ocx073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Femi-Abodunde A., Olinger K., Burke L.M.B., Benefield T., Lee E.R., McGinty K., Mervak B.M. Radiology dictation errors with COVID-19 protective equipment: does wearing a surgical mask increase the dictation error rate? J. Digit. Imag. 2021;34:1294–1301. doi: 10.1007/s10278-021-00502-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Gnanamanickam J., Natarajan Y., R S.P.K. A hybrid speech enhancement algorithm for voice assistance application. Sensors. 2021;21:7025. doi: 10.3390/s21217025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Duan T., Xu X.K., Chen S.F., Zhang Q., Chen W.Y., Lu X.M., He X.L. Application of adaptive technology-based speech recognition system in 600 pathological grossing process. Chin. J. Pathol. 2021;50:1034–1038. doi: 10.3760/cma.j.cn112151-20210101-00001. [DOI] [PubMed] [Google Scholar]
- 82.iFLYTEK. Dental electronic medical record. [Online] Available: https://health.xfyun.cn/solutions/eHistory.
- 83.Voll K., Atkins S., Forster B. Improving the utility of speech recognition through error detection. J. Digit. Imag. 2008;21:371–377. doi: 10.1007/s10278-007-9034-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Klann J.G., Szolovits P. An intelligent listening framework for capturing encounter notes from a doctor-patient dialog. BMC Med. Inf. Decis. Making. 2009;9:S3. doi: 10.1186/1472-6947-9-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Qin B. Research on the application of intelligent speech recognition technology in medical big data fog computing system. J. Decis. Syst. 2021:1–13. [Google Scholar]
- 86.Singh M., Bharti S., Kaur H., Arora V., Saini M., Kaur M., Singh J. A facial and vocal expression based comprehensive framework for real-time student stress monitoring in an IoT-Fog-Cloud environment. IEEE Access. 2022;10:63177–63188. [Google Scholar]
- 87.Goss F.R., Zhou L., Weiner S.G. Incidence of speech recognition errors in the emergency department. Int. J. Med. Inf. 2016;93:70–73. doi: 10.1016/j.ijmedinf.2016.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Andriole K.P., Prevedello L.M., Dufault A., Pezeshk P., Bransfield R., Hanson R., Doubilet P.M., Seltzer S.E., Khorasani R. Augmenting the impact of technology adoption with financial incentive to improve radiology report signature times. J. Am. Coll. Radiol. 2010;7:198–204. doi: 10.1016/j.jacr.2009.11.011. [DOI] [PubMed] [Google Scholar]
- 89.Lee S.-H., Park J., Yang K., Min J., Choi J. Accuracy of cloud-based speech recognition open application programming interface for medical terms of Korean. J. Kor. Med. Sci. 2022;37:e144. doi: 10.3346/jkms.2022.37.e144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Goss F.R., Blackley S.V., Ortega C.A., Kowalski L.T., Landman A.B., Lin C.T., Meteer M., Bakes S., Gradwohl S.C., Bates D.W., et al. A clinician survey of using speech recognition for clinical documentation in the electronic health record. Int. J. Med. Inf. 2019;130 doi: 10.1016/j.ijmedinf.2019.07.017. [DOI] [PubMed] [Google Scholar]
- 91.Rodger J.A., Pendharkar P.C. A field study of the impact of gender and user's technical experience on the performance of voice-activated medical tracking application. Int. J. Hum. Comput. Stud. 2004;60:529–544. [Google Scholar]
- 92.Wang J., Jo C. Annu Int Conf IEEE Eng Med Biol Proc. IEEE; 2007. Vocal folds disorder detection using pattern recognition methods; pp. 3253–3256. [DOI] [PubMed] [Google Scholar]
- 93.Heris H.K., Aghazadeh B.S., Nikkhah-Bahrami M. Optimal feature selection for the assessment of vocal fold disorders. Comput. Biol. Med. 2009;39:860–868. doi: 10.1016/j.compbiomed.2009.06.014. [DOI] [PubMed] [Google Scholar]
- 94.Ali Z., Hossain M.S., Muhammad G., Sangaiah A.K. An intelligent healthcare system for detection and classification to discriminate vocal fold disorders. Future Generat. Comput. Syst. 2018;85:19–28. [Google Scholar]
- 95.Madruga M., Campos-Roca Y., Perez C.J. Multicondition training for noise-robust detection of benign vocal fold lesions from recorded speech. IEEE Access. 2020;9:1707–1722. [Google Scholar]
- 96.Verde L., De Pietro G., Alrashoud M., Ghoneim A., Al-Mutib K.N., Sannino G. Leveraging artificial intelligence to improve voice disorder identification through the use of a reliable mobile app. IEEE Access. 2019;7:124048–124054. [Google Scholar]
- 97.Saeedi N.E., Almasganj F. Wavelet adaptation for automatic voice disorders sorting. Comput. Biol. Med. 2013;43:699–704. doi: 10.1016/j.compbiomed.2013.03.006. [DOI] [PubMed] [Google Scholar]
- 98.Saeedi N.E., Almasganj F., Torabinejad F. Support vector wavelet adaptation for pathological voice assessment. Comput. Biol. Med. 2011;41:822–828. doi: 10.1016/j.compbiomed.2011.06.019. [DOI] [PubMed] [Google Scholar]
- 99.Fonseca E.S., Guido R.C., Scalassara P.R., Maciel C.D., Pereira J.C. Wavelet time-frequency analysis and least squares support vector machines for the identification of voice disorders. Comput. Biol. Med. 2007;37:571–578. doi: 10.1016/j.compbiomed.2006.08.008. [DOI] [PubMed] [Google Scholar]
- 100.Amir O., Abraham W.T., Azzam Z.S., Berger G., Anker S.D., Pinney S.P., Burkhoff D., Shallom I.D., Lotan C., Edelman E.R. Remote speech analysis in the evaluation of hospitalized patients with acute decompensated heart failure. JACC-Heart Fail. 2022;10:41–49. doi: 10.1016/j.jchf.2021.08.008. [DOI] [PubMed] [Google Scholar]
- 101.Ali Z., Muhammad G., Alhamid M.F. An automatic health monitoring system for patients suffering from voice complications in smart cities. IEEE Access. 2017;5 [Google Scholar]
- 102.Muhammad G., Alhamid M.F., Hossain M.S., Almogren A.S., Vasilakos A.V. Enhanced living by assessing voice pathology using a co-cccurrence matrix. Sensors. 2017;17:267. doi: 10.3390/s17020267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Muhammad G., Rahman S.K.M.M., Alelaiwi A., Alamri A. Smart health solution integrating IoT and cloud: a case study of voice pathology monitoring. IEEE Commun. Mag. 2017;55:69–73. [Google Scholar]
- 104.Lopez-de-Ipina K., Sole-Casals J., Eguiraun H., Alonso J.B., Travieso C.M., Ezeiza A., Barroso N., Ecay-Torres M., Martinez-Lage P., Beitia B. Feature selection for spontaneous speech analysis to aid in Alzheimer's disease diagnosis: a fractal dimension approach. Comput. Speech Lang. 2015;30:43–60. [Google Scholar]
- 105.Alhussein M., Muhammad G. Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access. 2018;6:41034–41041. [Google Scholar]
- 106.Alhussein M., Muhammad G. Automatic voice pathology monitoring using parallel deep models for smart healthcare. IEEE Access. 2019;7:46474–46479. [Google Scholar]
- 107.Fang S.H., Wang C.T., Chen J.Y., Tsao Y., Lin F.C. Combining acoustic signals and medical records to improve pathological voice classification. APSIPA Trans. Signal Inf. Process. 2019;8:e14. [Google Scholar]
- 108.Tuncer T., Dogan S., Ozyurt F., Belhaouari S.B., Bensmail H. Novel multi center and threshold ternary pattern based method for disease detection method using voice. IEEE Access. 2020;8:84532–84540. [Google Scholar]
- 109.Muhammad G., Alhussein M. Convergence of artificial intelligence and internet of things in smart healthcare: a case study of voice pathology detection. IEEE Access. 2021;9:89198–89209. [Google Scholar]
- 110.Lauraitis A., Maskeliunas R., Damasevicius R., Krilavicius T. A mobile application for smart computer-aided self-administered testing of cognition, speech, and motor impairment. Sensors. 2020;20:3236. doi: 10.3390/s20113236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Mirheidari B., Blackburn D., Walker T., Reuber M., Christensen H. Dementia detection using automatic analysis of conversations. Comput. Speech Lang. 2019;53:65–79. [Google Scholar]
- 112.Becker J.T., Boiler F., Lopez O.L., Saxton J., McGonigle K.L. The natural history of Alzheimer's disease: description of study cohort and accuracy of diagnosis. Arch. Neurol. 1994;51:585–594. doi: 10.1001/archneur.1994.00540180063015. [DOI] [PubMed] [Google Scholar]
- 113.Kishimoto T., Takamiya A., Liang K.C., Funaki K., Mimura M. The project for objective measures using computational psychiatry technology (PROMPT): rationale, design, and methodology. Contemp. Clin. Trials. 2020;19 doi: 10.1016/j.conctc.2020.100649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Eyben F., Wöllmer M., Schuller B. Proc. ACM MM. ACM; 2010. Opensmile: the munich versatile and fast open-source audio feature extractor; pp. 1459–1462. [Google Scholar]
- 115.Rodrigues Makiuchi M., Warnita T., Inoue N., Shinoda K., Yoshimura M., Kitazawa M., Funaki K., Eguchi Y., Kishimoto T. Speech paralinguistic approach for detecting dementia using gated convolutional neural network. IEICE Trans. Info Syst. 2021;E104D:1930–1940. [Google Scholar]
- 116.Orozco-Arroyave J., Arias-Londoo J., Vargas-Bonilla J., González-Rátiva M., Nth E. Int. Conf. Lang. Resourc. and Eval.; 2014. New Spanish speech corpus database for the analysis of people suffering from Parkinsons disease; pp. 342–347. [Google Scholar]
- 117.Janbakhshi P., Kodrasi I., Bourlard H. Subspace-based learning for automatic dysarthric speech detection. IEEE Signal Process. Lett. 2021;28:96–100. [Google Scholar]
- 118.Svoboda E., Bořil T., Rusz J., Tykalová T., Horáková D., Guttmann C.R.G., Blagoev K.B., Hatabu H., Valtchinov V.I. Assessing clinical utility of machine learning and artificial intelligence approaches to analyze speech recordings in multiple sclerosis: a pilot study. Comput. Biol. Med. 2022;148 doi: 10.1016/j.compbiomed.2022.105853. [DOI] [PubMed] [Google Scholar]
- 119.Cai Z., Gu J., Chen H.-L. A new hybrid intelligent framework for predicting Parkinson's disease. IEEE Access. 2017;5:17188–17200. [Google Scholar]
- 120.Vaiciukynas E., Verikas A., Gelzinis A., Bacauskiene M. Detecting Parkinson's disease from sustained phonation and speech signals. PLoS One. 2017;12 doi: 10.1371/journal.pone.0185613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Kuresan H., Samiappan D., Masunda S. Fusion of WPT and MFCC feature extraction in Parkinson's disease diagnosis. Technol. Health Care. 2019;27:363–372. doi: 10.3233/THC-181306. [DOI] [PubMed] [Google Scholar]
- 122.Karlsson F., Schalling E., Laakso K., Johansson K., Hartelius L. Assessment of speech impairment in patients with Parkinson's disease from acoustic quantifications of oral diadochokinetic sequences. J. Acoust. Soc. Am. 2020;147:839–851. doi: 10.1121/10.0000581. [DOI] [PubMed] [Google Scholar]
- 123.Sakar C.O., Serbes G., Gunduz A., Tunc H.C., Nizam H., Sakar B.E., Tutuncu M., Aydin T., Isenkul M.E., Apaydin H. A comparative analysis of speech signal processing algorithms for Parkinson's disease classification and the use of the tunable Q-factor wavelet transform. Appl. Soft Comput. 2019;74:255–263. [Google Scholar]
- 124.Naranjo L., Pérez C., Campos-Roca Y., Martín J. Addressing voice recording replications for Parkinson's disease detection. Expert Syst. Appl. 2016;46:286–292. [Google Scholar]
- 125.Pramanik M., Pradhan R., Nandy P., Bhoi A.K., Barsocchi P. Machine learning methods with decision forests for Parkinson's detection. Appl. Sci. 2021;11:581. [Google Scholar]
- 126.Laganas C., Iakovakis D., Hadjidimitriou S., Charisis V., Dias S.B., Bostantzopoulou S., Katsarou Z., Klingelhoefer L., Reichmann H., Trivedi D., et al. Parkinson's disease detection based on running speech data from phone calls. IEEE Trans. Biomed. Eng. 2022;69:1573–1584. doi: 10.1109/TBME.2021.3116935. [DOI] [PubMed] [Google Scholar]
- 127.Rawat A., Mishra S., Sharma Y., Khetarpal P. High accuracy multilayer autoencoder trained classification method for diagnosis of Parkinson's disease using vocal signals. J. Inf. Optim. Sci. 2022;43:93–99. [Google Scholar]
- 128.Hireš M., Gazda M., Drotár P., Pah N.D., Motin M.A., Kumar D.K. Convolutional neural network ensemble for Parkinson's disease detection from voice recordings. Comput. Biol. Med. 2022;141 doi: 10.1016/j.compbiomed.2021.105021. [DOI] [PubMed] [Google Scholar]
- 129.Shimon C., Shafat G., Dangoor I., Ben-Shitrit A. Artificial intelligence enabled preliminary diagnosis for COVID-19 from voice cues and questionnaires. J. Acoust. Soc. Am. 2021;149:1120–1124. doi: 10.1121/10.0003434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Sonde Health. Sonda Health (SH). [Online] Available: https://www. sondehealth.com/sondeone-page.
- 131.Stasak B., Huang Z.C., Razavi S., Joachim D., Epps J. Automatic detection of COVID-19 based on short-duration acoustic smartphone speech analysis. J. Healthcare Inform. Res. 2021;5:201–217. doi: 10.1007/s41666-020-00090-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Verde L., Pietro G.D., Ghoneim A., Alrashoud M., Al-Mutib K.N., Sannino G. Exploring the use of artificial intelligence techniques to detect the presence of coronavirus covid-19 through speech and voice analysis. IEEE Access. 2021;9 doi: 10.1109/ACCESS.2021.3075571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.Kamble M., Gonzalez-Lopez J., Grau T., Espín López J., Cascioli L., Huang Y.Q., Gomez-Alanis A., Patino J., Font R., Peinado A., et al. PANACEA cough sound-based diagnosis of COVID-19 for the DiCOVA 2021 Challenge. Proc. Annu. Conf. Int. Speech. Commun. Assoc. 2021:4271–4275. [Google Scholar]
- 134.Qian K., Schmitt M., Zheng H.Y., Koike T., Schuller B. Computer audition for fighting the SARS-CoV-2 corona crisis-introducing the multitask speech corpus for COVID-19. IEEE Internet Things. 2021;8:16035–16046. doi: 10.1109/JIOT.2021.3067605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 135.Villa-Parra A.C., Criollo I., Valadão C., Silva L., Coelho Y., Lampier L., Rangel L., Sharma G., Delisle-Rodríguez D., Calle-Siguencia J. Towards multimodal equipment to help in the diagnosis of COVID-19 using machine learning algorithms. Sensors. 2022;22:4341. doi: 10.3390/s22124341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Brown C., Chauhan J., Grammenos A., Han J., Hasthanasombat A., Spathis D., Xia T., Cicuta P., Mascolo C. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. ACM; 2021. Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data; pp. 3474–3484. [Google Scholar]
- 137.Son M.J., Lee S.P. COVID-19 diagnosis from crowdsourced cough sound data. Appl. Sci. 2022;12:1795. [Google Scholar]
- 138.Ponomarchuk A., Burenko I., Malkin E., Nazarov I., Kokh V., Avetisian M., Zhukov L. Project achoo: a practical model and application for COVID-19 detection from recordings of breath, voice, and cough. IEEE J. Sel. Top. Signal. Process. 2022;16:175–187. doi: 10.1109/JSTSP.2022.3142514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139.Chowdhury N.K., Kabir M.A., Rahman M.M., Islam S.M.S. Machine learning for detecting COVID-19 from cough sounds: an ensemble-based MCDM method. Comput. Biol. Med. 2022;145 doi: 10.1016/j.compbiomed.2022.105405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.Pahar M., Klopper M., Warren R., Niesler T. COVID-19 cough classification using machine learning and global smartphone recordings. Comput. Biol. Med. 2021;135 doi: 10.1016/j.compbiomed.2021.104572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 141.Pahar M., Klopper M., Warren R., Niesler T. COVID-19 detection in cough, breath and speech using deep transfer learning and bottleneck features. Comput. Biol. Med. 2022;141 doi: 10.1016/j.compbiomed.2021.105153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 142.Haider N.S., Singh B.K., Periyasamy R., Behera A.K. Respiratory sound based classification of chronic obstructive pulmonary disease: a risk stratification approach in machine learning paradigm. J. Med. Syst. 2019;43:255. doi: 10.1007/s10916-019-1388-0. [DOI] [PubMed] [Google Scholar]
- 143.Chang Z., Luo P., Yang B., Zhang X. Respiratory sound recognition of chronic obstructive pulmonary disease patients based on HHT-MFCC and short-term energy. J. Comput. Appl. 2021;41:598–603. [Google Scholar]
- 144.Yilmaz E., Mitra V., Sivaraman G., Franco H. Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech. Comput. Speech Lang. 2019;58:319–334. [Google Scholar]
- 145.Mani Sekhar S.R., Kashyap G., Bhansali A., Andrew A., Singh K. Dysarthric-speech detection using transfer learning with convolutional neural networks. ICT Express. 2021;8:61–64. [Google Scholar]
- 146.Narendra N.P., Alku P. Automatic assessment of intelligibility in speakers with dysarthria from coded telephone speech using glottal features. Comput. Speech Lang. 2021;65 [Google Scholar]
- 147.Pahar M., Klopper M., Reeve B., Warren R., Theron G., Niesler T. Automatic cough classification for tuberculosis screening in a real-world environment. Physiol. Meas. 2021;42 doi: 10.1088/1361-6579/ac2fb8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148.Botha G.H.R., Theron G., Warren R.M., Klopper M., Dheda K., van Helden P.D., Niesler T.R. Detection of tuberculosis by automatic cough sound analysis. Physiol. Meas. 2018;39 doi: 10.1088/1361-6579/aab6d0. [DOI] [PubMed] [Google Scholar]
- 149.Pahar M., Klopper M., Reeve B., Warren R., Theron G., Diacon A., Niesler T. Int. Conf. Electr., Comput. Energy Technol.; 2022. Automatic Tuberculosis and COVID-19 cough classification using deep learning; pp. 1–9. [Google Scholar]
- 150.Semiz B., Hersek S., Whittingslow D.C., Ponder L.A., Prahalad S., Inan O.T. Using knee acoustical emissions for sensing joint health in patients with juvenile idiopathic arthritis: a pilot study. IEEE Sensor. J. 2018;18:9128–9136. doi: 10.1109/JSEN.2018.2869990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151.Holmes G., Donkin A., Witten I.H. Proc. Of ANZIIS. 1994. WEKA: a machine learning workbench; pp. 357–361. [Google Scholar]
- 152.Klangpornkun N., Ruangritchai M., Munthuli A., Onsuwan C., Jaisin K., Pattanaseri K., Lortrakul J., Thanakulakkarachai P., Anansiripinyo T., Amornlaksananon A. 43rd Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. IEEE; 2021. Classification of depression and other psychiatric conditions using speech features extracted from a Thai psychiatric and verbal screening test; pp. 651–656. [DOI] [PubMed] [Google Scholar]
- 153.Jokić S., Cleres D., Rassouli F., Steurer-Stey C., Puhan M.A., Brutsche M., Fleisch E., Barata F. TripletCough: cougher identification and verification from contact-free smartphone-based audio recordings using metric learning. IEEE J. Bio.Med. Health. 2022;26:2746–2757. doi: 10.1109/JBHI.2022.3152944. [DOI] [PubMed] [Google Scholar]
- 154.Mobram S., Vali M. Depression detection based on linear and nonlinear speech features in I-vector/SVDA framework. Comput. Biol. Med. 2022;149 doi: 10.1016/j.compbiomed.2022.105926. [DOI] [PubMed] [Google Scholar]
- 155.Byeon H. Comparing ensemble-based machine learning classifiers developed for distinguishing hypokinetic dysarthria from presbyphonia. Appl. Sci. 2021;11:2235. [Google Scholar]
- 156.Titze I.R., Martin D.W. Principles of voice production. J. Acoust. Soc. Am. 1998;104:1148. [Google Scholar]
- 157.Tracey B., Patel S., Zhang Y., Chappie K., Volfson D., Parisi F., Adans-Dester C., Bertacchi F., Bonato P., Wacnik P. Voice biomarkers of recovery from acute respiratory illness. IEEE J. Biomed. Health. 2022;26:2787–2795. doi: 10.1109/JBHI.2021.3137050. [DOI] [PubMed] [Google Scholar]
- 158.Halpern B., van Son R., Brekel M., Scharenborg O. Proc. Annu. Conf. Int. Speech. . Commun. Assoc.; 2020. Detecting and analysing spontaneous oral cancer speech in the wild; pp. 4826–4830. [Google Scholar]
- 159.Quintas S., Mauclair J., Woisard V., Pinquier J. Proc. Annu. Conf. Int. Speech. . Commun. Assoc.; 2020. Automatic prediction of speech intelligibility based on X-vectors in the context of head and neck cancer; pp. 4976–4980. [Google Scholar]
- 160.Tsanas A., Little M., McSharry P., Ramig L. Accurate telemonitoring of Parkinson's disease progression by non-invasive speech tests. IEEE Trans. Biomed. Eng. 2009;57:884–893. doi: 10.1109/TBME.2009.2036000. [DOI] [PubMed] [Google Scholar]
- 161.Woldert-Jokisz B. 2007. Saarbruecken Voice Database. [Online]. Available: http://www.stimmdatenbank.coli.uni-aarland.de/help_en.php4. [Google Scholar]
- 162.Rudzicz F., Namasivayam A.K., Wolff T. The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Comput. Humanit. 2012;46:523–541. [Google Scholar]
- 163.Ugo Cesari, De Pietro Giuseppe, Marcian Elio, Nir Ciro, Sannino Giovanna, Verde Laura. A new database of healthy and pathological voices. Comput. Electr. Eng. 2018;68:310–321. [Google Scholar]
- 164.Sakar B.E., Isenkul M.M., Sakar C.O., Sertbas A. Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings. IEEE J. Biomed. Health. 2013;17:828–834. doi: 10.1109/JBHI.2013.2245674. [DOI] [PubMed] [Google Scholar]
- 165.Kim H., Hasegawa-Johnson M., Perlman A., Gunderson J., Frame S. Dysarthric speech database for universal access research. Proc. Annu. Conf. Int. Speech. Commun. Assoc. 2008:1741–1744. [Google Scholar]
- 166.Sharma N., Krishnan P., Kumar R., Ramoji S., Chetupalli S.R., Ghosh P.K., Ganapathy S. Proc. Annu. Conf. Int. Speech. . Commun. Assoc.; 2020. Coswara--a database of breathing, cough, and voice sounds for COVID-19 diagnosis; pp. 4811–4815. [Google Scholar]
- 167.Orlandic L., Teijeiro T., Atienza D. The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Sci. Data. 2021;8:156. doi: 10.1038/s41597-021-00937-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 168.Liu Y., Lee T., Law T., Lee K.Y. Acoustical assessment of voice disorder with continuous speech using ASR posterior features. IEEE/ACM Trans. Audio. SPE. 2019;27:1047–1059. [Google Scholar]
- 169.Barbera D.S., Huckvale M., Fleming V., Upton E., Crinion J. Proc. Annu. Conf. Int. Speech. Commun. Assoc.; 2020. An utterance verification system for word naming therapy in Aphasia; pp. 706–710. [Google Scholar]
- 170.Barbera D.S., Huckvale M., Fleming V., Upton E., Coley-Fisher H., Doogan C., Shaw I., Latham W., Leff A.P., Crinion J. NUVA: a naming utterance verifier for aphasia treatment. Comput. Speech Lang. 2021;69 doi: 10.1016/j.csl.2021.101221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 171.Jayalakshmy S., Sudha G.F. Conditional GAN based augmentation for predictive modeling of respiratory signals. Comput. Biol. Med. 2021;138 doi: 10.1016/j.compbiomed.2021.104930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 172.Baird A., Cummins N., Schnieder S., Krajewski J., Schuller B.W. An evaluation of the effect of anxiety on speech-computational prediction of anxiety from sustained vowels. Proc. Annu. Conf. Int. Speech. Commun. Assoc. 2020:4951–4955. [Google Scholar]
- 173.Kim M.J., Kim H. Proc. Annu. Conf. Int. Speech. Commun. Assoc.; 2012. Combination of multiple speech dimensions for automatic assessment of dysarthric speech intelligibility; pp. 1322–1325. [Google Scholar]
- 174.Kim M.J., Kim Y., Kim H. Automatic intelligibility assessment of dysarthric speech using phonologically-structured sparse linear model. IEEE/ACM Trans. Audio. SPE. 2015;23:694–704. [Google Scholar]
- 175.Martínez D., Green P., Christensen H. Proc. Annu. Conf. Int. Speech. Commun. Assoc.; 2013. Dysarthria intelligibility assessment in a factor analysis total variability space; pp. 2132–2136. [Google Scholar]
- 176.Kadi K.L., Selouani S.A., Boudraa B., Boudraa M. Proc. Of the World Congr. on Eng.; 2013. Discriminative prosodic features to assess the dysarthria severity levels; pp. 2201–2205. [Google Scholar]
- 177.Kim J., Kumar N., Tsiartas A., Li M., Narayanan S.S. Automatic intelligibility classification of sentence-level pathological speech. Comput. Speech Lang. 2015;29:132–144. doi: 10.1016/j.csl.2014.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 178.Kayasith P., Theeramunkong T., Thubthong N. vol. 4139. 2006. Speech confusion index (O): a recognition rate indicator for dysarthric speakers; pp. 604–615. (Adv. In Natural Lang. Process., Proc). [Google Scholar]
- 179.Bhat C., Strik H. Automatic assessment of sentence-level dysarthria intelligibility using BLSTM. IEEE J. Sel. Topics in Signal Proc. 2020;14:322–330. [Google Scholar]
- 180.Dimauro G., Di Nicola V., Bevilacqua V., Caivano D., Girardi F. Assessment of speech intelligibility in Parkinson's disease using a speech-to-text system. IEEE Access. 2017;5:22199–22208. [Google Scholar]
- 181.Qin Y., Lee T., Kong A.P.H. IEEE Int Conf Acoust Speech Signal Process Proc. IEEE; 2019. Combining phone posteriorgrams from strong and weak recognizers for automatic speech assessment of people with aphasia; pp. 6420–6424. [Google Scholar]
- 182.Lee T., Liu Y., Yeung Y.T., Law T.K.T., Lee K.Y.S. Proc. Annu. Conf. Int. Speech. . Commun. Assoc.; 2016. Predicting severity of voice disorder from DNN-HMM acoustic posteriors; pp. 97–101. [Google Scholar]
- 183.Wahengbam K., Singh M.P., Nongmeikapam K., Singh A.D. A group decision optimization analogy-based deep learning architecture for multiclass pathology classification in a voice signal. IEEE Sensor. J. 2021;21:8100–8116. [Google Scholar]
- 184.Dhanalakshmi M., Nagarajan T., Vijayalakshmi P. Significant sensors and parameters in assessment of dysarthric speech. Sens. Rev. 2021;41:271–286. [Google Scholar]
- 185.Harimoorthy K., Thangavelu M. Cloud-assisted Parkinson disease identification system for remote patient monitoring and diagnosis in the smart healthcare applications. Concurr. Comput.-Pract. Exp. 2021;33:e6419. [Google Scholar]
- 186.Kambhampati A.B., Ramkumar B. Automatic detection and classification of systolic and diastolic profiles of PCG corrupted due to limitations of electronic stethoscope recording. IEEE Sensor. J. 2021;21:5292–5302. [Google Scholar]
- 187.Huang Y.J., Lin Y.T., Liu C.C., Lee L.E., Hung S.H., Lo J.K., Fu L.C. Assessing schizophrenia patients through linguistic and acoustic features using deep learning techniques. IEEE Trans. Neural Syst. Rehabil. 2022;30:947–956. doi: 10.1109/TNSRE.2022.3163777. [DOI] [PubMed] [Google Scholar]
- 188.Bhosale S., Tiwari U., Chakraborty R., Kopparapu S.K. Proc. Annu. Conf. Int. Speech. . Commun. Assoc.; 2021. Contrastive learning of cough descriptors for automatic COVID-19 preliminary diagnosis; pp. 946–950. [Google Scholar]
- 189.Casanova E., Candido A., Fernandes R.C., Finger M., Gris L.R.S., Ponti M.A., Da Silva D.P.P. Proc. Annu. Conf. Int. Speech. Commun. Assoc.; 2021. Transfer learning and data augmentation techniques to the COVID-19 identification tasks in ComParE 2021; pp. 4301–4305. [Google Scholar]
- 190.Gosztolya G., Bagi A., Szalóki S., Szendi I., Hoffmann I. Proc. Annu. Conf. Int. Speech. Commun. Assoc.; 2020. Making a distinction between schizophrenia and bipolar disorder based on temporal parameters in spontaneous speech; pp. 4566–4570. [Google Scholar]
- 191.Jonell P., Jonel P., Mol B., Hkansso K., Hente G.E., Beskow J. Multimodal capture of patient behaviour for improved detection of early dementia: clinical feasibility and preliminary results. Front. Comput. Sci. 2021;3 [Google Scholar]
- 192.Harar P., Galaz Z., Alonso-Hernandez J.B., Mekyska J., Burget R., Smekal Z. Towards robust voice pathology detection. Neural Comput. Appl. 2018;32:15747–15757. [Google Scholar]
- 193.Al Mamun K.A., Alhussein M., Sailunaz K., Islam M.S. Cloud based framework for Parkinson's disease diagnosis and monitoring system for remote healthcare applications. Future Generat. Comput. Syst. 2017;66:36–47. [Google Scholar]
- 194.Jiang L., Gao B., Gu J., Chen Y.P., Gao Z., Ma X.L., Kendrick K.M., Woo W.L. Wearable long-term social sensing for mental wellbeing. IEEE Sensor. J. 2019;19:8532–8542. [Google Scholar]
- 195.Nakamoto H., Katsuno Y., Yamamoto A., Umehara K., Bessho Y., Kobayashi F., Ishikawa A. Wearable band-shaped device and detection algorithm for laryngeal elevation in mendelsohn maneuver. IEEE Sensor. J. 2021;21:14352–14359. [Google Scholar]
- 196.Trinite B. Epidemiology of voice disorders in Latvian school teachers. J. Voice. 2017;31:508. doi: 10.1016/j.jvoice.2016.10.014. e1−508.e9. [DOI] [PubMed] [Google Scholar]
- 197.Gandhi J., Gadekar A., Rajabally T., Vinayakray-Jani P., Ambawade D. Int. Conf. On Comput. Commun. Netw. Technol.; 2021. Detection of Parkinsons disease via a multi-modal approach; pp. 1–7. [Google Scholar]
- 198.Pustina D., Coslett H.B., Ungar L., Faseyitan O.K., Medaglia J.D., Avants B., Schwartz M.F. Enhanced estimations of post‐stroke aphasia severity using stacked multimodal predictions. Hum. Brain Mapp. 2017;38:5603–5615. doi: 10.1002/hbm.23752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 199.Schulte A., Suarez-Ibarrola R., Wegen D., Pohlmann P.F., Petersen E., Miernik A. Automatic speech recognition in the operating room – an essential contemporary tool or a redundant gadget? A survey evaluation among physicians in form of a qualitative study. Ann. Med. Surg. (Lond) 2020;59:81–85. doi: 10.1016/j.amsu.2020.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 200.Zhang H., Xiao Q., Liu X. Construction and application of intelligent mobile ward round system based on speech recognition. Chin. J. Health Inform. And Manag. 2020;17:799–803. [Google Scholar]
- 201.Lin Z., Li Q., Xiang Y., Wang Q. Application and analysis of artificial intelligence voice system in postoperative follow-up of children with congenital cataract. Eye Sci. 2021;36:23–29. [Google Scholar]
- 202.Unisound. Intelligent follow-up system. [Online] Available: http://www.u nisound.com/follow-up.html.
- 203.Atrash A., Kaplow R., Villemure J., West R., Yamani H., Pineau J. Development and validation of a robust speech interface for improved human-robot interaction. Int. J. Soc. Robot. 2009;1:345–356. [Google Scholar]
- 204.Almutairi K., Abdlerazek S., Elbakry H., Ebada A.I. Development of smart healthcare system for visually impaired using speech recognition. Int. J. Adv. Comput. Sci. Appl. 2020;11:647–654. [Google Scholar]
- 205.Lee S., Kang S., Han D.K., Ko H. Dialogue enabling speech-to-text user assistive agent system for hearing-impaired person. Med. Biol. Eng. Comput. 2016;54:915–926. doi: 10.1007/s11517-015-1447-8. [DOI] [PubMed] [Google Scholar]
- 206.Fontan L., Le Coz M., Azzopardi C., Stone M.A., Fuellgrabe C. Improving hearing-aid gains based on automatic speech recognition. J. Acoust. Soc. Am. 2020;148:EL227–EL233. doi: 10.1121/10.0001866. [DOI] [PubMed] [Google Scholar]
- 207.LAPUL Preconditioning trainer (BB-RIC-D2) http://www.lapul.com/Product/7516845939.html [Online] Available:
- 208.Valencia-Garcia R., Martinez-Bejar R., Gasparetto A. An intelligent framework for simulating robot-assisted surgical operations. Expert Syst. Appl. 2005;28:425–433. [Google Scholar]
- 209.Ren K., Wang Y., Chen X., Cai H. Speaker-dependent speech recognition algorithm for laparoscopic supporter control. Laser Optoelectron Prog. 2020;57 [Google Scholar]
- 210.Tao J.H., Yang M.H., Wang Z.L., Ban X.J., Jie L. Non-contact multi-channel natural interactive surgical environment under sterile conditions. J. Softw. 2019;30:2986–3004. [Google Scholar]
- 211.Yoo T.K., Oh E., Kim H.K., Ryu I.H., Lee I.S., Kim J.S., Kim J.K. Deep learning-based smart speaker to confirm surgical sites for cataract surgeries: a pilot study. PLoS One. 2020;15 doi: 10.1371/journal.pone.0231322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 212.Li S.X., Zhang H., Liu J.Z., Yang W.Q., Zhu K.H. Speech control system for intelligent wheelchair based on SPCE061A. Comput. Eng. 2008;34:248–250. [Google Scholar]
- 213.Al-Rousan M., Assaleh K. A wavelet- and neural network-based voice system for a smart wheelchair control. J. Franklin Inst. 2011;348:90–100. [Google Scholar]
- 214.Wang H.T., Li Y.Q., Yu T.Y. Coordinated control of an intelligent wheelchair based on a brain-computer interface and speech recognition. J. Zhejiang Univ. Sci. Comput. & Electron. 2014;15:832–838. [Google Scholar]
- 215.Punith A., Manish G., Sumanth M.S., Vinay A., Karthik R., Jyothi K. Design and implementation of a smart reader for blind and visually impaired people. AIP Conf. Proc. 2021;2317 [Google Scholar]
- 216.Jothi K.R., Mamatha V.L., Saravana B.B., Yawalkar P. Speech intelligence using machine learning for aphasia individual. Int. Conf. Comput. Intell. Knowl.Econ. 2019:664–667. [Google Scholar]
- 217.Balaji V., Sadashivappa G. Int. Conf. Signal Process. Integr. . Netw.; 2019. Waveform analysis and feature extraction from speech data of dysarthric persons; pp. 955–960. [Google Scholar]
- 218.Lee S., Kang S., Ko H., Yoon J., Keum M. IEEE Int. Conf. Consum. Electron., Jan. IEEE; 2013. Dialogue enabling speech-to-text user assistive agent with auditory perceptual beamforming for hearing-impaired; pp. 360–361. [Google Scholar]
- 219.Akbarzadeh S., Lobarinas E., Kehtarnavaz N. Online personalization of compression in hearing aids via maximum likelihood inverse reinforcement learning. IEEE Access. 2022;10:58537–58546. [Google Scholar]
- 220.Gibson M., Coffin J.D.O.F.F. Recommendations for telemedicine reimbursement. J. Med. Pract. Manag.: J. Med. Pract. Manag. 2021;36:226–228. [Google Scholar]
- 221.Bhatt V.N. North Dakota State Univ.; Ann Arbor: 2020. Alexa for Health Practitioners.https://library.ndsu.edu/ir/handle/10365/31843 [Online] Available: [Google Scholar]
- 222.Nuance Increased health risks and a new telehealth playing field. https://whatsnext.nuance.com/healthcare/increased-health-risks-and-a-new-telehealth-playing-field/ [Online] Available:
- 223.Liu Y., Zuo S.P., Hsu C.L. Interactive cognitive training tool designed for autism spectrum disorder children. Sensor. Mater. 2021;33:405–413. [Google Scholar]
- 224.Bu X., Ng P.H., Tong Y., Chen P.Q., Fan R.R., Tang Q.P., Cheng Q.Q., Li S.S., Cheng A. Sk, Liu X.Y. A mobile-based virtual reality speech rehabilitation App for patients with Aphasia after stroke: development and pilot usability study. JMIR Serious Games. 2022;10 doi: 10.2196/30196. e30196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 225.Pahar M., Klopper M., Reeve B., Warren R., Theron G., Diacon A., Niesler T. European Signal Proces. Conf. 2022. Wake-Cough: cough spotting and cougher identification for personalised long-term cough monitoring; pp. 185–189. [Google Scholar]
- 226.Pahar M., Miranda I., Diacon A., Niesler T. Automatic non-invasive cough detection based on accelerometer and audio signals. J. Signal Process Syst. 2022;94:821–835. doi: 10.1007/s11265-022-01748-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 227.Olami A.I. Innovation in smart healthcare: VIA's new smart voice hospital bed card. http://www.cmia.info/news_detail.asp?id=14143 Available:
- 228.Shen Zhou, Yun Hai. Guidance robot. http://www.szyh-smart.com/hangye/yiyuan/index.html [Online] Available:
- 229.zorarobotics. Healthcare Robots Equipped with the Zora ZBOS. [Online] Available: https://www.zorarobotics.be/use-cases.
- 230.Zhang Y., Diao Y., Liang S., Ye C., Zhou Y., Zhao G. Cognitive-motion rehabilitation medical robot application design. Inf. Control. 2021;50:740–747. 760. [Google Scholar]
- 231.Rana R., Latif S., Gururajan R., Gray A., Mackenzie G., Humphris G., Dunn J. Automated screening for distress: a perspective for the future. Eur. J. Cancer Care. 2019;28 doi: 10.1111/ecc.13033. [DOI] [PubMed] [Google Scholar]
- 232.Ismail A., Abdlerazek S., El-Henawy I.M. Development of smart healthcare system based on speech recognition using support vector machine and dynamic time warping. Sustainability. 2020;12:2403. [Google Scholar]
- 233.Pranathi B.S., Nair A., Anushree C.S., Chandar T.S. Int. Conf. Comput. Commun. Netw. Technol.; 2020. Sahayantra - a patient assistance robot; pp. 1–6. [Google Scholar]
- 234.Hao C., Fengyuan L. Design and implementation of intelligent guidance service robot. Comput. Appl. and Softw. 2020;37:329–333. [Google Scholar]
- 235.Chen Y., Zhou Z.R., Cao M., Liu M., Lin Z.H., Yang W.X., Yang X., Dhaidhai D., Xiong P. Extended Reality (XR) and telehealth interventions for children or adolescents with autism spectrum disorder: systematic review of qualitative and quantitative studies. Neurosci. Biobehav. Rev. 2022;138 doi: 10.1016/j.neubiorev.2022.104683. [DOI] [PubMed] [Google Scholar]
- 236.Albarrak A.I., Mohammed R., Almarshoud N., Almujalli L., Aljaeed R., Altuwaijiri S., Albohairy T. Assessment of physician's knowledge, perception and willingness of telemedicine in Riyadh region, Saudi Arabia. J. Infect. Public Heal. 2021;14:97–102. doi: 10.1016/j.jiph.2019.04.006. [DOI] [PubMed] [Google Scholar]
- 237.Bhosale Y.H., Patnaik K.S. IoT deployable lightweight deep learning application for COVID-19 detection with lung diseases using RaspberryPi. Int. Conf. IoT Blockchain Technol. 2022:1–6. [Google Scholar]