Abstract
Deep neural networks (DNNs) have been useful in solving benchmark problems in various domains including audio. DNNs have been used to improve several speech processing algorithms that improve speech perception for hearing impaired listeners. To make use of DNNs to their full potential and to configure models easily, automated machine learning (AutoML) systems are developed, focusing on model optimization. As an application of AutoML to audio and hearing aids, this work presents an AutoML based voice activity detector (VAD) that is implemented on a smartphone as a real-time application. The developed VAD can be used to elevate the performance of speech processing applications like speech enhancement that are widely used in hearing aid devices. The classification model generated by AutoML is computationally fast and has minimal processing delay, which enables an efficient, real-time operation on a smartphone. The steps involved in real-time implementation are discussed in detail. The key contribution of this work include the utilization of AutoML platform for hearing aid applications and the realization of AutoML model on smartphone. The experimental analysis and results demonstrate the significance and importance of using the AutoML for the current approach. The evaluations also show improvements over the state of art techniques and reflect the practical usability of the developed smartphone app in different noisy environments.
Index terms—: Automated Machine Learning, AutoML, Voice Activity Detection (VAD), Hearing aid devices (HADs), smartphone, real-time
I. INTRODUCTION
Applications of deep learning in the field of audio have increased significantly. This is due to the availability of data and improved computing resources. Deep learning has been applied to solve a lot of challenging tasks in computer vision, natural language processing [1], and several fundamental blocks of hearing aid devices (HADs) like speaker recognition [2], and noise reduction [3], [4]. In these domains, deep neural networks (DNNs) have produced results, which match or exceed the performance of state-of-the-art conventional benchmark techniques. Therefore, researchers are constantly improving various deep learning methods for their specific applications.
The performance of the DNNs mainly depend on the training data, the network architecture and the choice of hyper parameters. Even though the input data or the training features are selected based on the relevance to the task, there are considerable variations in the performance. This is due to the selection of DNN architecture for an individual domain specific task. Researchers in academia and industry focus on discovering specialized architectures for specific tasks. However, it is extremely difficult to design an optimized network, as there are no clear-cut guidelines and principles for finding the right combination of architecture and hyperparameters. The manual experimentations for choice of these architecture and the parameters can be labor intensive, tedious, and require expertise. Therefore, instead of choosing the architecture and the hyperparameters empirically, they are usually considered based on the previous research or some application specific convenience. In addition, the choice of parameters vary due to the constraints like latency, energy consumption and model size. In recent years, an active field of research has developed around automated machine learning (AutoML) focusing on methods that aim at automating different stages of the machine learning (ML) process in order to overcome some of the aforementioned problems.
AutoML methods mostly work on neural architecture search (NAS). In NAS, combination of different ML components from a predefined search space are used to generate robust and well-performing neural network models. Novel effective approaches have been proposed for the selection of the model structure [5], [6] and optimization of the hyperparameters for designing the model structure. Various optimization methods were proposed, starting from Reinforcement Learning (RL) approaches [5]–[7], Sequential model based optimization [8], Evolutionary method [9], [10], Bayesian optimization approach [11], [12], Gradient-based models [13], [14] etc. Detailed survey regarding AutoML techniques and comparison of NAS algorithms are seen in [15]–[17]. AutoML has several advantages like automatic configuration of the model, choice of realizing simple and practical architectures, and importantly, the possibility of applying ML to a specific task with minimal expertise. Due to these advantages, AutoML has been applied to practical problems like medical image classification [18].
As an application of AutoML for audio signal classification, we present an AutoML based voice activity detector (VAD). To the best of our knowledge, there are no published works where they apply AutoML for voice activity detection or any other audio classification task. In hearing aid (HA) systems, speech signals are amplified to compensate for patients’ hearing loss. To achieve high de-noise efficiency, signal processing pipeline in the HA systems have VADs. Often VADs are used to identify the vocal segments of the noisy speech signals that contain speech activity [19], [20]. The automated deep learning model development platform (Google Cloud AutoML Vision API, beta release) is presented with spectral images obtained by speech signal to generate a VAD classification model. The AutoML model is simple, efficient and has a low processing delay. The developed method is implemented on a smartphone running as a real-time app. We use smartphones as our research platform as they have efficient computational capability with superior ARM multicore processors which can run complex neural network algorithms and ubiquitous as large population possess smartphone [3]. In this setup, the smartphone captures the noisy speech. The AutoML VAD classifies if the input segment as noise or speech. The classification result can be seen by the HA users on the screen of the smartphone. The proposed VAD can be used as key modeule in HAD signal processing applications like speaker recognition, noise reduction [21], [22], sound source localization [23] etc. The analysis regarding the significance of AutoML to audio, real-time implementation and results obtained from AutoML based VAD are discussed.
II. PROPOSED AUTOML BASED VAD
This section discusses the feature and the developed VAD algorithm. Figure. 1 shows the block diagram of the proposed method.
Fig. 1.
Block diagram of the developed AutoML based VAD and illustration of the modules used for real-time processing.
A. Signal model and training feature
Speech processing techniques often consider the additive mixture model for noisy speech y(n), with clean speech s(n) and noise w(n), as
(1) |
The noisy kth short time Fourier transform (STFT) coefficient of y(n) for frame is given by,
(2) |
where S and W are the speech, and noise STFT coefficients, respectively. In polar coordinates, (2) can be written as,
(3) |
Rk(λ), Ak(λ) , Bk(λ)are magnitude spectra of noisy speech, clean speech, and noise respectively ,, are the phase spectra of noisy speech, clean speech and noise respectively. As mentioned earlier, considering relevant features is important. Wide ranges of options are available to parametrically represent the speech signal. We consider log power spectra (LPS) of the speech signal as the fundamental feature for training the AutoML. Choice of the LPS is based on the facts that this feature carries comparatively more information about the speech signal [24], simple to generate and it does not add much delay to the input/ output (i/o) latency which is significant for real-time processing. We consider processing only the magnitude part of the spectrum and minimal importance of the phase is considered in the current application [25]. Due to symmetry, only the first half of LPS is considered as the fundamental feature for the classification of noise only frame or a speech segment from the given noisy speech. We consider
(4) |
Where, is the log power spectra of noisy speech at every kth frequency bin for current time frame λ.
B. AutoML model for classification
For generating the AutoML model for the VAD, we make use of Google Cloud AutoML Vision API [26]. The vision API aids with the cloud-based training with the interaction for framework using a graphical user interface. It enables supervised learning of custom model exclusively for images. Importantly, it provides an option to configure the model based on the latency and size of the model. For the proposed method, a low-size and a low-latency is trained for deploying them on smartphones.
The large dataset containing the spectral images labelled as noise or speech is used for the training of AutoML. The AutoML which works based on neural architecture search (NAS) [5] and transfer learning [27] to accelerate the process of NAS, provide the trained VAD classification model. Based on the training data, the AutoML Vision API trains and generates a model whose architecture and the hyperparameters are automatically tuned. This model is then used for the offline evaluation and smartphone implementation.
III. REAL-TIME IMPLEMENTATION ON SMARTPHONE
This section provides the tools and steps used for the real-time implementation on a smartphone. The proposed VAD was trained and tested using the input image data. The LPS for noisy speech were extracted, the input images were created and labelled using MATLAB. The AutoML modeling and the training which are offline procedures, were performed using Google Cloud AutoML Vision API [26]. The trained model is saved in TensorflowLite (.tflite) version, which is the format to facilitate the model deployment on smartphone platform. Xcode [28] was used for coding and debugging of the VAD. For deployment on iOS-based smartphone (e.g. iPhone), the GUI was coded in Objective C and to carry out input/output (i/o) handling we use Core Audio [29] which is an open source library from Apple.
In this work, iPhone X running iOS 12.0.1 operating system is considered for smartphone implementation. Smartphones come with 2 or 3 mics, however, we make use of default microphone (Figure 2) on iPhone X to capture the audio data and process the audio signal. The optimal parameters for the proposed VAD constitute sampling frequency of 16 kHz with a processing frame size of 32 ms with 50% overlap. To achieve the lowest latency audio setup on iOS smartphones, it is required to read and write audio data samples at a sampling rate of 48 kHz. This latency is related to i/o of the smartphone. Therefore, for real-time processing, the input is captured at 48 kHz giving 1536 samples for each 32ms frame. The frames are then down sampled to 16 kHz by low-pass filtering and decimation factor of 3. This produces a down sampled frame of size 512 samples, which is again 32ms in time. A 512-point FFT is computed for each frame. Considering only the first half of magnitude spectra, 257 LPS features are generated. Once these features are generated, a circular buffer [3] collects the features from 5 previous frames to process. The AutoML model which is trained using the Vision API expects the input to be in a 3 dimensional RGB image format. Therefore, the indexed array data in the circular buffer is color mapped to an RGB image. OpenCV [30,16] wrappers are used for resizing the input based on the input size required by the AutoML model.
Fig. 2.
Snapshot of developed AutoML based VAD application running on a smartphone.
Figure 2 shows a snapshot of the configuration screen of the algorithm implemented on iPhone X. When the switch button present is in ‘OFF’ mode, the application captures the audio data but no processing takes place on the smartphone. Switching ‘ON’ the button enables the VAD module to process the incoming audio stream and display the VAD decision. The application displays noise frame or speech frame for the incoming input signal. Initially when the switch ‘ON’ activates, no processing takes place for about 2 seconds to create the circular buffer. Ideally, algorithms run smoothly with the lowest audio latency and without skipping any frames if the processing time is less than the frame size. The overall processing time after the initial circular buffer creating time is approximately 12ms, which is less than the frame size. The maximum tolerable delay for HA users is around 24–30ms [31]. We note that the latency of the proposed method is less than the acceptable delay. The AutoML model size is around 0.5 MB, The memory consumption of the entire processing is 44.3 MB, the energy impact is low, and the maximum CPU usage is 9%, which are significantly low. These results show that the proposed AutoML VAD app running on a smartphone is energy efficient. Based on our experiments, the application runs for approximately 9.5 hours on a completely charged iPhone X with a 2716 mAh battery. AutoML based VAD working in real-time on a smartphone app is shown in a video demo [32]. The proposed method can be used as a standalone application by the hearing impaired listeners using smartphone or it can be integrated with applications that are use smartphone as an assistive tool for HADs [3], [23].
IV. EXPERIMENTAL ANALYSIS AND RESULTS
A. Model Objective Evaluation
To train and evaluate the developed AutoML VAD, clean speech and noise were added at 3 different signal to noise ratio (SNR) levels of 0, +5 and +10 dB to generate noisy speech. The clean speech files used for evaluation was the combination of HINT and TIMIT corpus. Noise files from DCASE 2017 challenge dataset [33] were used for noise data. Three major types of outdoor noise namely, multi-talker babble, machinery and traffic noise were selected from DCASE. In addition, 10 smartphone recorded realistic noise for each noise type were collected and used to generate noisy speech. Audio files from different corpus and from manual collection were used to diversify the data.
To evaluate the proposed VAD, we use Speech Hit Rate (SHR), the fraction of actual speech frames correctly classified as speech, and Noise Hit Rate (NHR), that is the fraction of noise frames correctly classified as noise. For most of the speech processing systems, it is important to get both SHR and NHR high because a lower NHR and SHR mean inaccurate estimations of noise and speech frames respectively. To the best of our knowledge, there are no published works on AutoML based VAD. Therefore, performance of AutoML VAD is compared with a conventional statistical model-based VAD (Sohn’s VAD) [34] and with a deep neural network based VAD [20]. The experimental evaluations are performed for three different noise types; machinery, multi-talker babble, and traffic noise at three different SNR levels. The results reported in Table I and Table II show the accuracy of the VAD in terms of SHR and NHR in various noisy environments. We can observe that the proposed AutoML generated speech classification method outperforms the other VAD methods. AutoML based VAD showed an average improvement of around 5% and 22% when compared to DNN based VAD and Sohn VAD respectively. Along with these results, the p-value for the proposed AutoML VAD was calculated. The accuracy results were calculated for more than 30 sets of unseen data and the p-value was ≈2.93%. This shows that the proposed method is statistically significant.
TABLE I.
SHR comparison for the three VADs in different noise environments at 0dB, +5dB and +10dB SNR.
SNR(dB) | Method | Traffic(%) | Babble(%) | Machinery(%) |
---|---|---|---|---|
0 | Sohn | 71.75 | 73.75 | 72.61 |
DNN | 83.00 | 81.65 | 79.18 | |
Proposed | 85.37 | 88.24 | 84.81 | |
5 | Sohn | 73.65 | 79.14 | 76.15 |
DNN | 87.20 | 84.57 | 82.66 | |
Proposed | 89.23 | 91.13 | 87.50 | |
10 | Sohn | 76.40 | 81.65 | 78.79 |
DNN | 88.94 | 88.63 | 86.52 | |
Proposed | 91.47 | 94.05 | 92.97 |
TABLE II.
NHR comparison for the three VADs in different noise environments at 0dB, +5dB and +10dB SNR.
SNR(dB) | Method | Traffic(%) | Babble(%) | Machinery(%) |
---|---|---|---|---|
0 | Sohn | 69.75 | 71.74 | 66.6 |
DNN | 83.78 | 84.00 | 87.66 | |
Proposed | 89.27 | 88.14 | 92.17 | |
5 | Sohn | 72.10 | 78.04 | 72.39 |
DNN | 87.95 | 85.96 | 90.71 | |
Proposed | 90.42 | 89.21 | 94.21 | |
10 | Sohn | 76.24 | 81.29 | 75.79 |
DNN | 90.24 | 89.96 | 91.05 | |
Proposed | 93.15 | 91.07 | 97.06 |
B. Model analysis
The AutoML vision API trains the model for the given data which is optimized and all the weights are tuned. We can check the architecture of the trained model once we export (.tflite) for offline evaluations. To visualize the architecture of the model, Netron [35] software tool was used. We noticed the model was based on convolutional neural network. The depth of the network was 32. 2D convolutional layers were stacked along with one fully connected layer as the model architecture. There were 8-skip connections used which usually learn the residuals. The considered kernel sizes ranged from 8 to 112. A stride of size (1, 1) was used for convolution. In most of the convolutional layers, Rectified Linear Unit (ReLU) [36] was considered as activation function to generate the output at each feature map. The final fully connected layer is connected to softmax activation function to provide the classification output. We note that based on the data provided and the choice of interference time (current model is chosen for least latency), the Vision API creates the architecture and optimizes the hyperparameters. Even though the depth is large in the current model, the size of the model is 0.5 MB. The model is optimized in a way to have minimal latency and is implementable on smartphones in real time.
V. CONCLUSIONS
In this paper, we proposed an AutoML based VAD for hearing aid applications that works on a smartphone in real-time. We made use of AutoML platform to audio which provides highly optimized model in terms of network architecture and hyperparameters. The AutoML framework provided low latency and efficient architecture without much compromise in performance. The proposed VAD can be a key module to improve speech processing pipeline in HADs. The objective tests validated the functionality of the proposed VAD and its usability in real world conditions. We note that the AutoML plus the smartphone offer an affordable and portable platform that can be used by audio researchers, audiologists and educators with or without domain expertise.
Acknowledgments
This work was supported by the National Institute of the Deafness and Other Communication Disorders (NIDCD) of the National Institutes of Health (NIH) under the grant number 5R01DC015430-04. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The authors are with the Statistical Signal Processing Research Laboratory (SSPRL), Department of Electrical and Computer Engineering, The University of Texas at Dallas.
REFERENCES
- [1].Gardner Matt, et al. ”Allennlp: A deep semantic natural language processing platform.” arXiv preprint arXiv:1803.07640 (2018). [Google Scholar]
- [2].Snyder David, et al. ”X-vectors: Robust dnn embeddings for speaker recognition” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. [Google Scholar]
- [3].Bhat GS, Shankar N, Reddy CKA and Panahi IMS, ”A Real-Time Convolutional Neural Network Based Speech Enhancement for Hearing Impaired Listeners Using Smartphone,” in IEEE Access, vol. 7, pp. 78421–78433, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Xu Y, Du J, Dai L and Lee C, ”An Experimental Study on Speech Enhancement Based on Deep Neural Networks,” in IEEE Signal Processing Letters, vol. 21, no. 1, pp. 65–68, Jan. 2014. [Google Scholar]
- [5].Zoph B and Le QV, “Neural architecture search with reinforcement learning.” [Online]. Available: http://arxiv.org/abs/1611.01578.
- [6].Pham H, Guan MY, Zoph B, Le QV, and Dean J, “Efficient neural architecture search via parameter sharing,” vol.ICML.[Online].Available: http://arxiv.org/abs/1802.03268
- [7].Zoph B, Vasudevan V, Shlens J, and Le QV, “Learning transferable architectures for scalable image recognition.” [Online]. Available: http://arxiv.org/abs/1707.07012.
- [8].Pham Hieu, et al. ”Efficient neural architecture search via parameter sharing.” arXiv preprint [Google Scholar]
- [9].Stanley KO and Miikkulainen R, “Evolving neural networks through augmenting topologies,” vol. 10, no. 2, pp. 99–127. [DOI] [PubMed] [Google Scholar]
- [10].Liu H, Simonyan K, Vinyals O, Fernando C, and Kavukcuoglu K,“Hierarchical representations for efficient architecture search,” in ICLR, p. 13. [Google Scholar]
- [11].Mendoza H, Klein A, Feurer M, Springenberg JT, and Hutter F,“Towards automatically-tuned neural networks,” p. 8. [Google Scholar]
- [12].Zela A, Klein A, Falkner S, and Hutter F, “Towards automated deep learning: Efficient joint neural architecture and hyperparameter search.” [Online]. Available: http://arxiv.org/abs/1807.06906
- [13].Liu H, Simonyan K, and Yang Y, “DARTS: Differentiable architecture search.” [Online]. Available: http://arxiv.org/abs/1806.09055
- [14].Ahmed K and Torresani L, “MaskConnect: Connectivity learning by gradient descent.” [Online]. Available: http://arxiv.org/abs/1807.11473
- [15].Liang Jason, et al. ”Evolutionary neural automl for deep learning.” arXiv preprint arXiv:1902.06827 (2019). [Google Scholar]
- [16].Madrid Jorge G., et al. ”Towards AutoML in the presence of Drift: first results.” arXiv preprint arXiv:1907.10772 (2019). [Google Scholar]
- [17].He Xin, Zhao Kaiyong, and Chu Xiaowen. ”AutoML: A Survey of the State-of-the-Art.” arXiv preprint arXiv:1908.00709 (2019). [Google Scholar]
- [18].Faes L, Wagner SK, Fuet DJ, et al. Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study. Lancet Digital Health 2019; 1: 232–42. [DOI] [PubMed] [Google Scholar]
- [19].Yang M, Yeh C, Zhou Y, Cerqueira JP, Lazar AA and Seok M, ”Design of an Always-On Deep Neural Network-Based 1- μ W Voice Activity Detector Aided With a Customized Software Model for Analog Feature Extraction,” in IEEE Journal of Solid-State Circuits, vol. 54, no. 6, pp. 1764–1777, June 2019. [Google Scholar]
- [20].Zhang Xiao-Lei, and Wu Ji. ”Denoising deep neural networks based voice activity detection.” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013. [Google Scholar]
- [21].Bhat GS, Reddy CK, Shankar N, Panahi IM (2018, July). Smartphone based real-time super Gaussian single microphone Speech Enhancement to improve intelligibility for hearing aid users using formant information In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 5503–5506). IEEE. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Reddy CKA, Shankar N, Bhat GS, Charan R, Panahi I (2017). An individualized super-Gaussian single microphone speech enhancement for hearing aid users with smartphone as an assistive device. IEEE signal processing letters, 24(11), 1601–1605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Kuçuk A,Ganguly A, Hao Y and Panahi IMS, ”Real-Time Convolutional Neural Network-Based Speech Source Localization on Smartphone,” in IEEE Access, vol. 7, pp. 169969–169978, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Wang Yuxuan, Narayanan Arun, and Wang DeLiang. ”On training targets for supervised speech separation.” IEEE/ACM transactions on audio, speech, and language processing 2212 (2014): 1849–1858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Wang D and Lim J, “The unimportance of phase in speech enhancement,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679–681, 1982. [Google Scholar]
- [26].Google (2019) https://cloud.google.com/vision/automl/docs/
- [27].Wong C, Houlsby N, Lu Y, and Gesmundo A, “Transfer learning with neural automl,” in Advances in Neural Information Processing Systems, 2018, pp. 8356–8365. [Google Scholar]
- [28].Apple (2019) https://developer.apple.com/xcode/
- [29].Apple(2019)https://developer.apple.com/library/archive/documentation/MusicAudio/Conceptual/CoreAudioOverview/WhatisCoreAudio/WhatisCoreAudio.html
- [30].OpenCV (2019) https://docs.opencv.org/master/d3/def/tutorial_image_manipulation.html
- [31].Alexander Joshua. ”Hearing aid delay and current drain in modern digital devices.” Canadian Audiologist 34 (2016). [Google Scholar]
- [32].SSPRL(2019)https://www.utdallas.edu/ssprl/hearing-aid-project/video-demonstration/
- [33].Mesaros A, Heittola T, and Virtanen T. (Jan. 1, 2017). TUT Acoustic Scenes 2017, Development Dataset. [Online]. Available: https://zenodo.org/record/400515
- [34].Sohn Jongseo, Kim Nam Soo and Sung Wonyong, ”A statistical model based voice activity detection,” in IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, Jan. 1999. [Google Scholar]
- [35].Netron (2019) https://electronjs.org/apps/netron
- [36].Maas Andrew L, Hannun Awni Y, and Ng Andrew Y,“Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, 2013, vol. 30. [Google Scholar]