Abstract
Robust speech source localization (SSL) is an important component of the speech processing pipeline for hearing aid devices (HADs). SSL via time direction of arrival (TDOA) estimation has been known to improve performance of HADs in noisy environments, thereby providing better listening experience for hearing aid users. Smartphones now possess the capability to connect to the HADs through wired or wireless channel. In this paper, we present our findings about the non-uniform non-linear microphone array (NUNLA) geometry for improving SSL for HADs using an L-shaped three-element microphone array available on modern smartphones. The proposed method is implemented on a frame-based TDOA estimation algorithm using a modified Dictionary-based singular value decomposition method (SVD) method for localizing single speech sources under very low signal to noise ratios (SNR). Unlike most methods developed for uniform microphone arrays, the proposed method has low spatial aliasing as well as low spatial ambiguity while providing a robust low-error with 360° DOA scanning capability. We present the comparison among different types of microphone arrays, as well as compare their performance using the proposed method.
Keywords: Non-uniform microphone arrays, Speech source localization, Hearing aid devices, Smartphone, Low SNR
1 Introduction
According to a report by World Health Organization (WHO), more than 5% of the world’s population, which accounts for 360 million people, suffer from disabling hearing loss [1]. An estimated 8% of children and 17% of American adults report having some form of hearing loss, while nearly 50% of adults ages 75 years and older have hearing loss. Nevertheless, only 20% of people who could benefit from hearing aids actually use them [2]. Factors that influence whether a person chooses to wear a hearing aid include the perceived versus actual benefits, cost, stigma, value (benefit relative to price) and the inferior performance of hearing aids in noisy conditions. The performance of hearing devices, including hearing aid devices (HADs), cochlear implants (CIs), and sound amplifiers (SAs), is adversely impacted by the presence of environmental noises. This causes sub-optimal performance of the speech-processing pipeline in presence of noise, resulting in discomfort to the hearing aid users.
As a result, several researchers and hearing aid manufacturers [3–6] have proposed, implemented and commercialized efficient algorithms to improve the performance of HADs in noisy environments. A majority of these algorithms are in the realm of speech enhancement, noise reduction [7, 8], acoustic feedback cancellation [9, 10], single speech source localization and microphone array beamforming [11–13]. From a physiological stand point, these algorithms can greatly improve speech perception in the noisy environment. In almost all reported research, one finds that increasing the signal to noise ratio (SNR) of the received noisy speech signal greatly improves the noise suppression, leading to enhanced speech signal with high perceptual quality (and low distortion).
However, it should be noted that for people with hearing loss, improving the SNR alone, while preserving the quality and intelligibility of desired speech, might not result in ‘spatially natural’ sounding speech. This is due to the fact that hearing loss also hampers the ability to localize or identify the direction of the sound sources for many people. It was reported in [14] that hearing impaired people do experience localization difficulties and that the degree of difficulty increases with the increasing hearing loss. Moreover, the hearing-impaired groups rate their localization as poorer than the normal hearing group and that the more severely impaired group rate their localization as poorer than the less severe group. A more detailed study on related topic is presented in [15]. Fast and reliable speech source localization (SSL) also becomes crucial for a person with hearing loss in understanding group conversations under noisy conditions. When the conversation moves from one speaker to another, he/she must be able to locate the new speaker instantly, or they will miss the initial part of each segment of conversation, thereby hindering contextual understanding greatly.
Currently most HADs lack the computational power to handle complex signal processing algorithms primarily due to small size, processor and battery limitations. To assist current HADs in an efficient and cost-effective manner, an alternate approach (Fig. 1) was proposed in [16]. The idea is to use the available hardware on modern smartphones with powerful signal processing capabilities and implement the signal processing algorithms on them. The smartphones are widely available and used by so many people including those with hearing problems. Thus, using a smartphone poses practically no additional cost to the HAD user. The smartphone would assist the HAD by: (i) Collecting data using onboard microphone array; (ii) Processing the collected data using highly optimized signal processing algorithms; and (iii) Communicate/stream the processed data to the user (on the smartphone display) or to the HAD over a wireless or wired connection. By doing so, HAD can avoid running computationally intensive algorithms (thus, saving on battery power), while the smartphone can handle more complex algorithms (owing to more powerful processors and larger battery) to provide an overall synergic advantage to the user over using HADs alone. More details about the project can be found on [17].
Figure 1.
Signal Processing pipeline using Smartphone assisted HAD.
As noted before, SSL is an important signal pre-processing method that can be used to improve SNR, de-reverberation, suppression of background noise, and enhancement of speech with high perceptual quality. Smartphone and its features can be used to deploy appropriate SSL methods which would improve hearing experience of the HAD users. The use of microphone array and beamforming to find the direction of arrival (DOA) of the source signal is a popular approach for SSL. Performance of this approach depends on many factors such as the type of noisy speech, type and geometry of microphone array, number of microphones and the SNR. The focus of this paper lies in the analysis of non-uniform non-linear microphone arrays (NUNLA) (also called “L-shaped” microphone arrays), available on smartphones, for estimating the DOA of single speech sources on smartphone as an assistive tool for HADs. In this paper, we present our contribution in SSL for improving the experience of hearing aid users under very noisy conditions (low SNR) using the special microphone array hardware available on popular smartphones. We propose a frame-based robust algorithm for finding the DOA of speech sources in noisy environment using the microphones of a smartphone. The estimated DOA can be communicated to the hearing aid user through visual information displayed on the smartphone panel and/or via wire or wireless connection to HADs. The HAD user can then align his/her position and the position of the smartphone with the source direction for optimum hearing reception.
In several noisy environments, such as in group conversations, sitting around a table in restaurants, business meetings, and in lecture halls, sound is often assumed to originate from only one dominant point source or talker [18]. Under this assumption, SSL algorithms can be greatly simplified, thus facilitating faster DOA estimation over smaller time-frames of data. Hence, in this work, we focus our attention only on finding the DOA of single source (called the principal source) using short frames of data (typically 20–100 ms). The frame lengths can be decided based on sound source dynamics, desired precision and the algorithm used. Under rare situations with overlapping speech, we find the DOA of the speech source with highest energy using sinusoidal modelling similar to that in [18].
Numerous researchers and audiologists have studied SSL using microphone arrays for hearing aids and their implications on improving speech perception. In the existing literature, uniform linear microphone arrays(ULAs) and non-uniform linear microphone arrays (NULAs) are most often studied, while the NUNLA have received limited attention. Although, the analytical behavior of NUNLA are often complex to study (due to the myriad possible geometries possible), they have been reported [19–25] to offer significant benefits over ULA and NULA. In [19], the design and implementation of a NULA geometry using differential microphone arrays (DMA) has been studied. The NULA used in this work is derived from an ULA by adding an arithmetic sequence with a common difference σ (=0.1 cm) to the inter-element spacing d (=1 cm). A minimum-norm filter, obtained by maximizing the gain of the beamformer output, is utilized. The authors have shown that the use of NULA geometry can significantly improve the robustness of DMAs, particularly at low frequencies. A non-linear microphone array based on complementary beamforming for speech enhancement is presented in [20, 21]. The authors claim that the proposed method improves the performance by lowering the word error rates. However, no detailed description about the microphone array itself is provided. An L-Shaped microphone array configuration has been presented in [22] for impulsive acoustic source localization in a reverberant environment. The localization technique relies on a time delay estimation technique based on the orthogonal clustering algorithm (TDE-OC) which is designed to work under room reverberant conditions and at low sampling rates. Another three element L-shaped geometry was proposed in [23] for sound localization based on time delay of arrival (TDOA) estimates. The location of a sound source is determined from the intersection of hyperbolic curves produced from the TDOA estimates. Microphone array support for ULA, NULA and NUNLA as part of a complete audio subsystem has also been provided for many PCs, laptops, and other computing devices as well. Microsoft published a support document in [24] which outlines different array geometries (uniform and non-uniform) and compares their performance. The results have been presented for up to four element L- shaped microphone arrays, which shows a higher directivity index (DI) of 10.2 dB as compared to 9.9 dB for linear microphone array with same number of elements. In [25], Widrow provides a very detailed overview of the use of a six-element V-shaped microphone array (another example of a NUNLA) using a telecoil to communicate wirelessly with the hearing aid. Although wearing such a microphone array as a ‘necklace’ (as suggested in the paper) seems bulky, this paper draws three important conclusions about the performance of the NUNLA: (i) It enhances SNR by up to 10 dB relative to omnidirectional background noise; (ii) It reduces the effect of reverberation; and (iii) The NUNLA reduces the acoustic feedback between the hearing aid and the microphones (on the NUNLA) by up to 15 dB. While most SSL algorithms are based on TDOA estimation, researchers have spent impressive amount of resources on methods based on steered-response power(SRP) [26–28], maximum likelihood(ML) [29], Eigen decomposition (such as [30, 31]), sparse signal recovery [32], among others. An extensive review of the most popular SSL algorithms are provided in [33, 34].
In this paper, a three-element NUNLA (in L-shaped geometry) with closely spaced microphones (separated by unequal inter-element distances) is used to illustrate the advantages of the proposed method. We propose a singular value decomposition (SVD) based TDOA SSL algorithm to localize single speech sources under very low SNR from a single frame of audio data. Unlike most methods using ULA and NULA, the proposed method has lower spatial aliasing [35] as well as lower spatial ambiguity (explained later) while providing a robust low-error DOA estimate with 360° scanning capability. Detailed analysis and experimental results are presented showing the performance of proposed method under the effects of low SNRs, different data lengths, and different interelement spacing. We also provide a useful characterization of the performance of proposed method using root mean square errors (RMSE) encountered in DOA estimation through color maps for ULA, NULA and NUNLA. Finally, the performance of proposed method is compared with those of other popular DOA estimation algorithm using ULA and NULA.
The outline of the paper is as follows: Section 2 briefly reviews the SSL and DOA with respect to hearing aid applications. Some popular microphone array geometries are also addressed. A concise explanation about spatial aliasing encountered in microphone arrays is presented too. Section 3 introduces our approach, along with the algorithms and the NUNLA formation available on Nexus 6. In Section 4, experimental results are analyzed and performance of proposed method is compared with those of other popular methods under several conditions. Implications of these methods for real-time deployment using smartphone are also considered. Section 5 is the conclusion.
2 Sound Source Localization (SSL) Using Microphone Arrays
Humans can perceive and locate sound locations in three dimensions through their ears by exploiting multiple cues such as inter-aural time differences(ITD), interaural level differences(ILD) and head related transfer functions(HRTF), among others [34]. While these spatial cues are critical for robust SSL by humans, they cannot be often processed by the hearing impaired individuals at several frequencies. In the context of microphone arrays, subtle differences in the input data obtained from each microphone in the microphone array provide information about inter-microphone time differences and inter-microphone level differences. These cues are efficiently exploited by the DOA estimation algorithms to estimate the location (direction and distance) of the source signal. As a result, SSL using microphone arrays demands greater data handling capability and sophisticated signal processing algorithms. Unfortunately, many existing HADs cannot handle such requirements. By assisting HADs with microphone arrays for SSL, better performance can be achieved in terms of SSL and beam forming techniques. Most commonly used microphone arrays (preferably each microphone has omnidirectional spatial response) can be classified broadly into the following three architectures [13, 33]:
1. Uniform Linear Arrays (ULA)
In ULA, the inter-microphone spacing, d is uniform and the microphones are arrays arranged in a straight line (linear) i.e. ITDs are equal. Figure 2a shows a three element ULA with uniform spacing distance d. For such ULAs, the ITD τ is given by (1):
| (1) |
Where θ is the DOA (measured from the end-fire direction) and c is the speed of sound in the medium. For air, c = 330 m/s is a fair assumption.
Figure 2.
Microphone Arrays (a) Uniform Linear Arrays (ULA), (b)Non-uniform Linear Arrays, and (c) Non-uniform Non-linear Arrays. d and v(≠d) are the inter-element spacing.
2. Non-Uniform Linear Arrays (NULA)
In NULA, the inter-microphone spacing, d is non-uniform, but the microphones are arranged in a linear fashion. Figure 2b shows a three element NULA with spacing d and v (d ≠ v).
3. Non Uniform Non Linear Microphone Array (NUNLA)
For NUNLA, the ITD between the microphones are not the same and the positioning of microphones is not linear as it was in the previous cases. This architecture provides additional information in terms of ITD for more accurate SSL, which is not available from the ULA and NULA architectures. As a result, NUNLA can handle a wider range of source frequencies than ULA depending on its orientation. The NUNLA also has lower spatial aliasing and no/little spatial ambiguity problem (as discussed later). Figure 2c shows a three element NUNLA arranged in an ‘L’ shaped geometry used in this paper.
2.1 Spatial Aliasing in Microphone Arrays
For a given frequency of source signal, spatial aliasing occurs when the inter-element spacing d is not small enough to spatially sample the impinging sound waves (assumed plane waves for the far-field case). This leads to ‘unwanted’ peaks in the directivity pattern, which leads to errors in DOA estimation [13, 35]. Mathematically, for ULA with uniform inter-element spacing, d, it can be explained as (2):
| (2) |
where λmin is the smallest wavelength of the plane wave at source frequency fmax. c is the speed of sound in air. For example, if d is fixed at 1 cm, the usable bandwidth of the source signal can extend up to fmax = 16.5 kHz at c = 330 m/s. For most practical microphone arrays, the geometry and dimensions are fixed. So, it is imperative to identify the usable frequency bandwidth before they can be used for reliable DOA estimation algorithms. For ULA, the spatial condition is well defined using the mathematical relation in (2). However, for NULA and NUNLA, other methods (presented later) need to be used.
Let d = αdmax = α λmin/2, then for α= 1, there is no spatial aliasing (‘Critical Sampling’). However, if α≥ 1 (‘Under-Sampling’), there is the occurrence of spatially aliased peaks in the directivity pattern. Lastly, if α≤ 1, there is no spatial aliasing (‘Over Sampling’). These processes are analogous to those of the Nyquist sampling theorem and the aliasing phenomenon in frequency domain. As the SNR of signal arriving at the microphone array deteriorates, spatial aliasing becomes worse, leading to detection of inaccurate peaks in the directivity pattern. Few comments need to be made here: (i) Spatial aliasing (which occurs due to inadequate sampling of plane waves) should not be confused with spatial ambiguity or ‘Front-Back’ ambiguity (Fig. 3) which occurs due to the symmetry in microphone array w.r.t the look direction. (ii) Spatial aliasing is independent of DOA of the source signal, but depends only on the relation between d (or v) and the source frequency. Spatial ambiguity, on the other hand, depends on the spatial arrangement of microphone array and source location. For example, in a ULA case, DOA of both +45° and −45° may be detected for a single source located at either +45° or at −45°.
Figure 3.
Directivity pattern showing Spatial Ambiguity in Uniform Linear Microphone Arrays (three element, d = 0.02 m) for (a) Low SNR (0 dB) and (b) High SNR (10 dB).
3 Proposed Source Localization Method
As mentioned before, we use NUNLA consisting of three microphones available on Motorola Nexus 6 and positioned rather closely to each other as shown in Fig. 2c in order to present the theoretical aspect of our SSL approach. However, this concept can be easily extended to other smartphones with three (or more) microphones as well. In this section, we present the signal model, our approach and the algorithms used for SSL.
3.1 Approach
The goal of time-delay based SSL is to accurately estimate the angular position (azimuth and elevation) of the source(s) using a known spatial arrangement of sensors array (microphones) by exploiting the ITD estimates. It is assumed that all microphones are identical. As mentioned before (and discussed later too), the algorithm used is ‘Dictionary-based Singular Value Decomposition (SVD) for Principal SSL’. This algorithm is a computationally simpler version of [32] under the assumption of localizing only the principal source and is similar to MUSIC [30], ESPRIT [31] used widely for SSL of narrowband signals. This algorithm is also more robust than [30] against noise and it does not assume linear microphone array as assumed in [30–32]. A performance comparison is presented later. The proposed algorithmis detailed in next section and block diagram is given in Fig. 4. The signal model here is given by:
| (3) |
Where Xi(n) is the microphone data, n = 1, 2, …, L for each ith microphone. s(n − Δni) is the received ‘source’ signal at each ith microphone. Δni denotes the time delay at each microphone. For now, w(n) is assumed to be Additive White Gaussian Noise (AWGN) and has a uniform spatial distribution and is diffuse sensor noise. We point out that our main objective here is to present an especial NUNLA architecture shown in Fig. 2c and its associated SSL method for implementation on a smartphone possessing three microphones as an assistive tool for the HAD users. As such, we skip discussing the effect of room reverberation on the proposed method in this paper.
Figure 4.
Block Diagram of proposed SSL method using SVD and NUNLA. d = 0.02 m and v = 0.13 m.
Now let d and v denote the inter-microphone spacing as shown in Fig. 2b, c, the ITDs Δtij for an unknown direction θ (to be estimated) are given by:
| (4) |
| (5) |
| (6) |
Where and l = (v2 + d2)0.5. For Nexus 6 smartphone, d = 1.2 cm and v = 12 cm ⇒ l = 13.055 cm, ϕ = 84.72°.
3.2 Algorithms
Let θ̂ be the estimated direction of the source. For each frame of data, the steps in the SSL algorithm are enumerated as follows:
Obtain input microphone data, Xi(n), n = 1, 2, …, L for each microphone i and form data matrix over time X3xL;
Perform SVD to obtain the estimated signal sub space, Xs3xL at each microphone;
Using frequency scanning vector fscan, compute the reference signal si(n) = exp(j2πfscann) and its subspace, Ss3xL using SVD ;
For each frequency fscan, generate/retrieve the over-complete dictionary matrix, A(i, θscan)3x360. Dictionary matrix, A is a matrix of all possible signal vectors corresponding to each value of fscan and θscan for each microphone i.
-
Scanning: For each fscan and θscan, compute:
(7) In (7), size of Xs is 3 × 1, A is 3 × 1 and Ss is 1 × 1 for each iteration.
- Find the location of maxima in (8) to estimate θ̂.
(8)
For far field scenario (i.e. the distance between source and microphone array >4 m), whenever θscan approaches θ̂, the value of NormSVD approaches unity (unique maxima). θscan = θstart to θend in increments of Δθ.
Our source signal is primarily a single speech source, which is a non-stationary wideband signal. Hence, we have modified (and simplified) [32] to handle single speech sources by converting the broadband speech into a ‘dominant’ narrowband single sinusoid at each microphone. Speech source under low SNR is handled using Auto-regressive (AR) modeling using time-frequency (T-F) analysis motivated by [18].
For handling for L samples (= equal to Frame length) of speech source data, following procedure is used (Fig. 4):
Framing and windowing of input microphone data Xi(n), n = 1, 2, …, L for every microphone, i = 1, 2, 3.
Band-pass filtering of input data between 400 and 1200 Hz (to reduce bandwidth and avoid spatial aliasing; explained later in Section 4);
Perform AR modeling to predict the sinusoidal peaks in each kth frame of noisy speech from T-F analysis; Using AR modeling for speech data, we can partially model the dominant components using exponentials (to be used later for DOA estimation) even under very low SNRs.
Location of the peak of the AR model frequency spectrum is an estimate of the dominant frequency, f0 in each frame, k. This frequency f0 will be used to generate scanning frequency vector fscan.
To reduce the scanning complexity of the algorithm: (i) we vary fscan = f0 ± Δf Hz, Δf = 100 Hz; and (ii) we reduce (and fix) the dimension of Xs3xL and Ss3xL, under the assumption of localizing only the principal source However, these can also be changed depending upon available computation capability of the underlying platform. If shorter L is used, the steps i.–vi. can be repeated to track the source over time. For more than one sources, sparsity constraints can be imposed on NormSVD to improve performance at the cost of increased computations [32].
Assuming f0 and A are known, the computational complexity for proposed SVD approach is approximated by: {O(L3) +O(3L) +LθscanLfscanO(L)} ≈O(L3), where Lθscan = length of vector θscan and Lfscan = length of vector fscan. On a 2 GHz Intel processor, the average processing time required by the proposed algorithm for a 20 ms frame of data (L =320 samples) was 9.4 ms (with RMSE = 0.9487°) and for a 50 ms frame of data (L =800 samples) was 11.2 ms (with RMSE = 0.8660°). This shows that the processing time is much less than the frame length of data (without any significant loss in accuracy), which allows fast operation of the proposed algorithm.
In the next section, we present the experiments conducted and results obtained for proposed method for SSL and infer important conclusions aimed at robust and faster DOA estimation. Performance comparisons with other methods with ULA and NULA are presented in the following sections.
4 Experiments and Results
In this section, several experiments are presented to highlight the advantages of the proposed SSL method for NUNLA. Simulation experiments are presented for two parameters: (a) Under different SNRs; and (b) Different data lengths, L. These two parameters are the most critical ones for future real-time deployment of the smartphone in a noisy environment. Lower SNR translates to difficulty in listening to the desired speech signal in presence of background noise. Data length affects the cost of deployment; larger data length leads to better performance at the cost of higher computational delay. Speech signals from IEEE database are used for experiments. f0 varies between 500 and 1200 Hz (explained later). Noise is assumed to be AWGN and SNR is varied between −5 dB to +5 dB. Data length, L is varied from 20 ms to 500 ms for sampling rate, Fs fixed at 16 kHz. This is consistent with the bandwidth (i.e. 8 kHz) available on current HADs. Higher sampling rates can also be used with the proposed method depending on the intended application.
For SSL performance evaluation, we use two performance metrics in this paper. First metric is the directivity pattern of the NUNLA and second metric is the root mean square error (RMSE) in the estimation of θ. Directivity pattern is a plot of NormSVD from (8) estimated using proposed algorithm over azimuth θscan = 0° to 359° in increments of 1° for a fixed frequency f0. Directivity patterns are plotted as polar directivity plots (PDP) (θscan is wrapped) or linear directivity plots (LDP) (θscan is unwrapped). The value of θscan with the highest amplitude is an estimate of θ̂. The choice of f0 also dictates the accuracy and sharpness of the directivity patterns.
RMSE is computed over N = 100 iterations, using (8):
| (9) |
where, (θk − θ̂k) is the estimation error between true DOA and estimated DOA. RMSE should be lower for better SSL.
4.1 Effect of Change in Signal to Background Noise(SNR) Ratio
Under very high background noise with uniform spatial distribution, it is very difficult for HAD users to understand speech coming from particular direction. To demonstrate this condition, we vary the SNR between the source signal and noise signal, and then estimate the direction of sound source under different SNRs. For the sake of completeness, SNR is calculated using (10) and is the ratio between signal power,| |s(n)|2 to noise power, | |v(n)|2, measured in dB:
| (10) |
where s(n) is a signal sample and v(n) is noise sample. n = 1, 2, …, L. A negative value of SNR indicates the most hostile case where the speech is well buried in noise. For this subsection, f0= 570 Hz (estimated using AR modelling described in previous section) and L = 1600 samples i.e. 100 ms of input microphone data. Results have been presented for four values of θ taken arbitrarily one in each direction viz. 48°, 127°, 267° and 308°. We also present our detailed analysis of RMSE for all possible DOA angles (θ) in later sections.
As shown in Fig. 5a–d, decrease in SNR leads to a broader PDP, independent of the value of θ. While the proposed method is able to scan complete 360° (unlike 180° in many previous works [30–32]), there is no spatial aliasing even at 0 dB SNR. Also, while most methods breakdown below 0 dB SNR, the proposed method has a low average RMSE of 2.292° at −5 dB SNR (from Table 1). This advantage is due to the combined use of the SVD algorithm and the NUNLA used in this paper. Table 1 also shows that by increasing the SNR, lower RMSE values can be achieved. Therefore, preprocessing on the microphone data (such as noise suppression using a-priori SNR estimation [36]) can result in lower RMSE in DOA estimation and increase accuracy.
Figure 5.
Polar Directivity Patterns(PDP) for Proposed NUNLA + SVD method at (a) θ = 48° (Front-Right), (b) θ = 127° (Front-Left), (c) θ = 267° (Back-Left), and (d) θ = 308° (Back-Right) at different SNRs (L = 100 ms).
Table 1.
RMSE values under different SNR
| SNR ↓ Angle → | RMSE* (°) | Average RMSE* (°) | |||
|---|---|---|---|---|---|
|
|
|||||
| 48° | 125° | 267° | 308° | ||
| −5 dB | 1.41 | 3.21 | 1.69 | 2.84 | 2.29 |
| 0 dB | 0.78 | 1.35 | 0.99 | 1.47 | 1.15 |
| 5 dB | 0.38 | 0.78 | 0.56 | 0.7071 | 0.61 |
Over 100 Simulation trials. L = 1600 samples = 100 ms
4.2 Effect of Different Data Lengths
Longer data length i.e. more information about the data yields better estimation of signal subspace and lowers RMSE, at the cost of increased computations. For this sub-section, f0= 570 Hz (estimated using AR modelling) and SNR= 0 dB of input microphone data. Results have been presented for four values of θ taken arbitrarily one in each direction viz. 48°, 127°, 267° and 308°.
Table 2 shows the RMSE values at four angles at different data lengths. From Table 2, we can see that using L = 500 ms has much lower RMSE than that for L = 20 ms. However, it is also interesting to note that using the proposed method, the average RMSE for L = 20 ms at 0 dB SNR is only 2.77°, which is well below acceptable limits for audio source localization (usually about 15°). For the purpose of estimating DOA of single sources, there is usually a trade-off between choice of L and RMSE values obtained. Smaller values of L indicates faster tracking of moving sound sources, but leads to higher RMSE values and vice versa. However, the proposed method can be tailored for intended DOA estimation application without much loss in accuracy.
Table 2.
RMSE values under different L
| L↓Angle → | RMSE* (°) | Average RMSE* (°) | |||
|---|---|---|---|---|---|
|
|
|||||
| 48° | 125° | 267° | 308° | ||
| 500 ms | 0.28 | 0.75 | 0.48 | 0.75 | 0.57 |
| 100 ms | 0.81 | 1.46 | 0.88 | 1.47 | 1.16 |
| 20 ms | 1.68 | 3.63 | 2.18 | 3.57 | 2.77 |
Over 100 Simulation trials. SNR = 0 dB
4.3 Effect of Inter-Element Spacing (d and v) on Spatial Aliasing and Spatial Ambiguity for NUNLA
In this section, we will present several experiments to illustrate the effect of d and v on the DOA estimation by using the directivity patterns. Recall that, if dmax = λmin/2 = c/2fmax then, d ≤ dmax implies there is no spatial aliasing at certain frequency f0 ≤ fmax. In this section, we fix fmax = 1 kHz ⇒ dmax = 0.165 m from (2). Also, θ = 25° is the true DOA angle and SNR = 20 dB.
In Fig. 6, for ULA, we plot the directivity pattern by varying d in terms of dmax, given by the relation d = αdmax, where α = {0.1, 1, 10} for sake of illustration. We observe that when d ≤ dmax (for α = {0.1, 1}) and 0 ≤ θ ≤ 179°, there is no spatial aliasing. Also, notice that there is no spatial ambiguity in this range. However, for d ≥ dmax at α = 10 and 0 ≤ θ ≤ 179°, we can observe the occurrence of unwanted ‘spatially aliased’ peaks in the directivity pattern. Next, if we observe the directivity pattern in range 0 ≤ θ ≤ 359°, spatial ambiguity can be witnessed as aliased peaks in the directivity pattern. Thus, ULA suffers from both spatial aliasing (when d ≥ dmax) and spatial ambiguity (when 0 ≤ θ ≤ 359°) for fixed value of fmax, θ and SNR.
Figure 6.
Linear Directivity pattern(LDP) for ULA illustrating the effect of spatial aliasing for d = αdmax, α = {0.1,1, 10}, fmax = 1 kHz.
For NULA (Fig. 7) and NUNLA (Fig. 8), we vary the spacing v = αdmax, where α = {0.1, 1, 10} to illustrate the effect of spatial aliasing and spatial ambiguity. For NULA (in Fig. 7), spatial aliasing is not as worse as in ULA, but spatial ambiguity still exists when 0 ≤ θ ≤ 359°. Interestingly, for NUNLA (in Fig. 8), both spatial aliasing and spatial ambiguity are alleviated irrespective of α and θ. Thus, Fig. 8 illustrates one of the main reasons to prefer a NUNLA for DOA estimation over other ULA and NULA.
Figure 7.
Linear Directivity pattern(LDP) for NULA illustrating the effect of spatial aliasing for v = αdmax, α = {0.1,1, 10}, fmax = 1 kHz.
Figure 8.
Linear Directivity pattern(LDP) for NUNLA illustrating the effect of spatial aliasing for v = αdmax, α = {0.1,1, 10}, fmax = 1 kHz.
In the rest of this section, we will examine the effect of v on spatial aliasing and spatial ambiguity, when v = αd, but d ≠ dmax. Here, we fix d = 1.2 cm and f0 = 1000 Hz. This is inspired from the typical inter-element spacing on hearing aids and between a pair of the Nexus 6 microphones. Other parameters have previously assumed values. Notice from Fig. 9 that for both NULA and NUNLA where v =10d, there is no spatial aliasing. In addition, for NUNLA, there is also no spatial ambiguity at v = 10d. So, in conclusion for NUNLA with d = 1.2 cm and v = 10 d = 12 cm, the effect of spatial aliasing and spatial ambiguity are largely mitigated for a fixed fmax and θ. The same conclusion can be drawn when we vary f0 (f0 = 500 Hz in Fig. 10 and f0 = 2000 Hz in Fig. 11), keeping all other parameters fixed. By comparing Figs. 9, 10 and 11, we can also notice that the main beam width becomes sharper with increase in frequency.
Figure 9.
Linear Directivity pattern(LDP) for (a) NULA and (b) NUNLA illustrating the effect of spatial aliasing for v = αd, α = {0.1,1, 10}, d = 1.2 cm, f0 = 1000 Hz.
Figure 10.
Linear Directivity pattern(LDP) for (a) NULA and (b) NUNLA illustrating the effect of spatial aliasing for v = αd, α = {0.1,1, 10}, d = 1.2 cm, f0 = 500 Hz.
Figure 11.
Linear Directivity pattern(LDP) for (a) NULA and (b) NUNLA illustrating the effect of spatial aliasing for v = αd, α = {0.1,1, 10}, d = 1.2 cm, f0 = 2000 Hz.
4.4 Effect of Orientation on Spatial Aliasing
In this section, we present our analysis by changing the orientation of NUNLA with respect to the DOA angle of the source signal. Although NUNLA have been shown to be more robust to spatial aliasing and spatial ambiguity than ULA and NULA, we would like to evaluate the its performance when the array orientation is changed. So we, analyze the directivity pattern of our NUNLA under four different orientations (as shown in Fig. 12).
Figure 12.
Different orientations of NUNLA analyzed: (a) NUNLA1, (b) NUNLA 2, (c) NUNAL 3, and (d) NUNLA 4.
The first orientation ‘NUNLA 1’ (Fig. 12a) is the default NUNLA and has and will been studied extensively in this paper. Now, NUNLA 2 (Fig. 12b) is NUNLA 1 rotated 90° about its vertical axis. NUNLA 3 (Fig. 12c) is NUNLA 1 rotated about its horizontal axis by 90°. Finally, NUNLA 4 (Fig. 12d) is NUNLA 3 rotated about its vertical axis by 90°. These four orientations are motivated by the most common orientations of the smartphone. The following parameters are kept constant for this section: d = 1.2 cm, v = 12 cm and θ = 39°. Following parameters are varied for this section: fmax = {500 Hz, 1 kHz, 2 kHz} and SNR = {0 dB, 10 dB}. The DOA angle θ varies as the elevation angle.
Two important inferences can be drawn from above results (Figs. 13, 14, 15, 16, 17, 18, and 19). Firstly, as long as f0 is less than ~2 kHz, NUNLA 1 and NUNLA 2 (Figs. 13a & b, 14a & b, 15a & b and 16a & b) have no or little spatial aliasing. At 2 kHz, NUNLA 1 and NUNLA 2 (Figs. 17a & b and 18a & b) show severe spatial aliasing. Interestingly, NUNLA 3 and NUNLA 4 have lower spatial aliasing under same conditions. The NUNLA here comprises of three ‘two-microphone’ arrays conditions (3-choose-2). Changing the orientation of the microphone array pair changes the microphone pair that would sample the incident plane wave. This leads to reduced spatial aliasing as the new orientation satisfies the condition given in (2). This shows that by changing the orientation of microphone array, the effect of spatial aliasing can be reduced at higher frequencies. The second inference is that when the SNR is increased, the spatially aliased peaks are much lower in amplitude. Hence, the DOA estimation errors due to spatial aliasing can also be reduced by increasing SNR via right orientation of the array (e.g. orientation of smartphone/Nexus 6 w.r.t to source/speaker location) and performing appropriate pre-filtering on the signal received at microphones. Finally, At higher SNRs, orientation 3 has least spatial aliasing, whereas at low SNRs, orientation 1 and 2 have lower spatial aliasing.
Figure 13.
Linear Directivity pattern(LDP) for different orientations of NUNLA analyzed: (a) NUNLA1, (b) NUNLA 2, (c) NUNLA 3, and (d) NUNLA 4 for fmax = 500 Hz and SNR = 0 dB.
Figure 14.
Linear Directivity pattern(LDP) for different orientations of NUNLA analyzed: (a) NUNLA1, (b) NUNLA 2, (c) NUNLA 3, and (d) NUNLA 4 for fmax = 500 Hz and SNR = 10 dB.
Figure 15.
Linear Directivity pattern(LDP) for different orientations of NUNLA analyzed: (a) NUNLA1, (b) NUNLA 2, (c) NUNLA 3, and (d) NUNLA 4 for fmax = 1000 Hz and SNR = 0 dB.
Figure 16.
Linear Directivity pattern(LDP) for different orientations of NUNLA analyzed: (a) NUNLA1, (b) NUNLA 2, (c) NUNLA 3, and (d) NUNLA 4 for fmax = 1000 Hz and SNR = 10 dB.
Figure 17.
Linear Directivity pattern(LDP) for different orientations of NUNLA analyzed: (a) NUNLA1, (b) NUNLA 2, (c) NUNLA 3, and (d) NUNLA 4 for fmax = 2000 Hz and SNR = 0 dB.
Figure 18.
Linear Directivity pattern(LDP) for different orientations of NUNLA analyzed: (a) NUNLA1, (b) NUNLA 2, (c) NUNLA 3, and (d) NUNLA 4 for fmax = 2000 Hz and SNR = 10 dB.
Figure 19.
Color map of ULA (d = 1.2 cm) for RMSE versus Frequency, f0 versus correct DOA angle θ of source at SNR = 0 dB, L =100 ms and 10 iterations for each combination of f0 and θ. Color legend indicates the RMSE values.
4.5 RMSE Color Maps for ULA, NULA and NUNLA
Figures 19, 20 and 21 presents RMSE encountered in ULA (three element, d = 1.2 cm), NULA (three element, d = 1.2 cm, v = 12 cm) and NUNLA (L shaped, three element, d = 1.2 cm, v = 12 cm) used in this paper, respectively. Figures 19, 20 and 21 can be interpreted as a map of all possible combinations of frequency, f0 and DOA angle θ of source that could result in DOA estimation errors. RMSE values (shown by color legend) is plotted versus frequency, f0 (x-axis) versus DOA angle, θ (y-axis) of the source and is calculated over 10 iterations for L = 100 ms at SNR = 0 dB for all possible frequency values and all possible directions. Frequency, f0 is varied from 500 Hz to 4 kHz in increments of 100 Hz and DOA, θ is varied from 1° to 360° in increments of 1°. As such, each heat map is a plot of RMSE values for about 126,000 iterations of the proposed algorithm for different possible combinations of f0 and θ. These errors (indicated by strong yellow color) can be attributed to incorrect estimation of maxima in (8) due to spatial aliasing and inaccuracies due to presence of noise. To avoid this situation, we first apply a bandpass filter to our input microphone data, Xi(n) between 500 and 1200 Hz as in Section 3. This serves two purposes: (i) Small frequency bandwidth reduces the scanning complexity for matrix A; and (ii) Eliminates any chance of spatial aliasing as shown in Fig. 21.
Figure 20.
Color map of NULA for RMSE versus Frequency, f0 versus correct DOA angle θ of source at SNR = 0 dB, L =100 ms and 10 iterations for each combination of f0 and θ. Color legend indicates the RMSE values.
Figure 21.
Color map of NUNLA for RMSE versus Frequency, f0 versus correct DOA angle θ of source at SNR = 0 dB, L =100 ms and 10 iterations for each combination of f0 and θ. Color legend indicates the RMSE values.
However, as seen in Fig. 21, NUNLA has least RMSE as compared with ULA (Fig. 19) and NULA (Fig. 20) across broadside angle (around 0°) and has minimum error at end fire angle (around 180°). This makes NUNLA more reliable for DOA applications along the broadside angles as compared to ULA and NULA. Also, the probability of committing an error over the entire spatio-spectral grid (f0 versus θ) is much lower in NUNLA as compared to ULA (as seen from the color legend). To better understand Fig. 21, we present the PDP for certain combinations of frequency f0 and DOA angle θ for six cases: In first three cases (Fig. 22a, b, c) we have minimum RMSE. In the last three cases (Fig. 22d, e, f), we observe severe spatial aliasing which results in maximum values of RMSE (as seen in Fig. 21). Results can be extrapolated for different SNRs and different data length based on our knowledge from previous results (Section 4 A and B).
Figure 22.
Polar Directivity Plots (PDP) explaining RMSE plots for NUNLA (Fig. 21) for particular combinations of frequency f0 and DOA angle θ for six cases: (a) Min. RMSE at f0 = 1000 Hz and θ = 90°, θ̂ = 88°; (b) Min. RMSE at f0 = 2000 Hz and θ = 90°, θ̂ = 89°; (c) Min. RMSE at f0 = 3000 Hz and θ = 90°, θ̂ = 89°; (d) Max. RMSE at f0 = 2600 Hz and θ = 90°, θ̂ = 271°; (e) Max. RMSE at f0 = 1300 Hz and θ = 90°, θ̂ = 272°; (f) Max. RMSE at f0 = 1300 Hz and θ = 270°, θ̂ = 90°.
4.6 Comparison with MUSIC [16]
In this section, we have compared the performance of the proposed method with a popular Eigen value based method proposed in [30]. Proposed SSL method using NUNLA shows lower spatial aliasing over MUSIC using ULA and MUSIC with NULA as illustrated in Fig. 23. It is also observed that the PDP is much sharper for proposed method using NUNLA over the other two methods. These advantages show that the NUNLA on the smartphone using proposed SSL algorithm is superior over previously used ULA/NULA.
Figure 23.
Polar Directivity plot (PDP) for comparison of Spatial aliasing for NULA +MUSIC (blue dash), NUNLA + MUSIC(green dot) & proposed NUNLA + SVD method (red solid) for θ̂ = 75°, f0 = 1000 Hz. SNR = 0 dB, L = 100 ms.
5 Conclusion
In this paper, we presented a SSL algorithm which can accurately localize single speech source using the NUNLA architecture under very low SNR conditions. The performance of the proposed method was tested for different SNRs and data length for speech sources to understand the implication of the method for real-time, i.e. frame-based, deployment and especially for using a smartphone platform. The proposed method is easily extendable to multiple sources for frame-based processing. The proposed algorithm using NUNLA is shown to have better performance than the traditionally used ULA and NULA architectures for a complete 360° scan. This would enhance the overall SSL performance for hearing aid devices (HADs) in noisy environments, thereby providing greater comfort to the HAD users.
Acknowledgments
Research supported by NIDCD of the National Institutes of Health (NIH) under award number 1R01DC015430-01. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Biographies

Anshuman Ganguly is currently pursuing his PhD degree at the Statistical Signal Processing Research Laboratory at The University of Texas at Dallas, Richardson, TX. He received his M.S. (Thesis) in Electrical Engineering at UTD in 2014. His current research interests include real-time Speech source localization, Microphone Array Beamforming and Microphone array processing. He is the recipient of ‘Best Student Paper Award’ for his work on Source Localization using Non-linear Microphone Arrays at IEEE SiPS 2016. He has previously worked at Bose Corp. as an ‘Audio Engineering Intern’ in 2016 and at Amazon’s Lab126 as a ‘Research Scientist Intern’ in 2017. He is a Member of the Honor Society of Phi Kappa Phi and IEEE-UTD.

Issa Panahi (S’84–M’88–SM’07) received Ph.D degree in electrical engineering from the University of Colorado at Boulder in 1988. He is a professor in the department of electrical and computer engineering (ECE) and also a research professor in the department of bioengineering at the University of Texas at Dallas (UTD). He is founding director of the Statistical Signal Processing and Audio/Acoustic/Speech Research Laboratories in the ECE Department. His research interests are in audio/acoustic/speech signal processing, noise & interference cancellation, signal detection & estimation, source separation, and system identification. He joined the faculty of UTD after working in research centers and industry for 18 years. Before joining UTD in 2001, he was a DSP chief architect, chief technology officer, advance systems development manager, and worldwide application manager, in the embedded DSP systems business unit at Texas Instruments (TI) Inc. He holds a US patent and is author/co-author of 4 books at TI and over 145 published conference, journal, and technical papers, including the ETRI Best Paper of 2013. Dr. Panahi founded and was vice chair of the IEEE-Dallas Chapter of EMBS. He is chair of the IEEE Dallas Chapter of SPS. He received the 2005 and 2011 “Outstanding Service Award” from the Dallas Section of IEEE. He is a senior member of IEEE. He was a member of organizing committee and Chair of the Plenary Sessions at IEEE ICASSP-2010. Dr. Panahi has been organizer and chair of signal processing sessions and associate editor of several IEEE international conferences since 2006.
References
- 1.Deafness and hearing loss NIH Fact sheet. http://www.who.int/mediacentre/factsheets/fs300/en/, updated February 2017.
- 2.Yesterday, Today & Tomorrow: NIH Research Timelines. https://report.nih.gov/NIHfactsheets/ViewFactSheet.aspx?csid=95, Updated on March 29, 2013.
- 3.Siemens Hearing Instruments. Product Portfolio – Spring/Summer. 2015 [Google Scholar]
- 4.Starkey Hearing, Hearing solutions catalog. 2015 Winter; [Google Scholar]
- 5.Phonak and Advanced Bionics. 2015 [Google Scholar]
- 6.Zounds Hearing. 2015 http://www.zoundshearing.com/corp/hearing-systems/zounds-difference/
- 7.Van den Bogaert T, et al. Speech enhancement with multichannel Wiener filter techniques in multi microphone binaural hearing aids. The Journal of the Acoustical Society of America. 2009;125(1):360–371. doi: 10.1121/1.3023069. [DOI] [PubMed] [Google Scholar]
- 8.Reddy CKA, Hao Y, Panahi I. Two microphones spectral-coherence based speech enhancement for hearing aids using smartphone as an assistive device. 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); Orlando, FL. 2016. pp. 3670–3673. [DOI] [PubMed] [Google Scholar]
- 9.Waterschoot TV, Moonen M. Fifty years of acoustic feedback control: State of the art and future challenges. Proceedings of the IEEE. 2011;99(2):288–327. [Google Scholar]
- 10.Khoubrouy SA, Panahi IMS, Hansen JHL. Howling detection in hearing aids based on generalized Teager–Kaiser operator. IEEE/ACM Transactions on Audio, Speech and Language Processing. 2015;23(1):154–161. [Google Scholar]
- 11.Brandstein MS. Ph.D. Dissertation. Brown University; Providence, RI, USA: 1995. A Framework for Speech Source Localization Using Sensor Arrays. [Google Scholar]
- 12.Ganguly A, Reddy C, Hao Y, Panahi I. 2016 I.E. International Workshop on Signal Processing Systems (SiPS) Dallas, TX: 2016. Improving sound localization for hearing aid devices using smartphone assisted technology; pp. 165–170. [Google Scholar]
- 13.McCowan, Iain. Microphone arrays: A tutorial. Queensland University, St Lucia QLD 4072; Australia: 2001. pp. 1–38. [Google Scholar]
- 14.Byrne D, Noble W. Optimizing sound localization with hearing aids. Trends in Amplification. 1998;3(2):51–73. doi: 10.1177/108471389800300202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.William N, Byrne D, Lepage B. Effects on sound localization of configuration and type of hearing impairment. The Journal of the Acoustical Society of America. 1994;95.2:992–1005. doi: 10.1121/1.408404. [DOI] [PubMed] [Google Scholar]
- 16.Panahi I, Kehtarnavaz N, Thibodeau L. Smartphone-based noise adaptive speech enhancement for hearing aid applications. 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); Orlando, FL. 2016. pp. 85–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Smartphone-Based Open Research Platform for Hearing Improvement Studies. http://www.utdallas.edu/ssprl/hearing-aid-project/
- 18.Zhang W, Rao BD. A two microphone-based approach for source localization of multiple speech sources. IEEE Transactions on Audio, Speech and Language Processing. 2010;18(8):1913–1928. [Google Scholar]
- 19.Hao Z, Chen J, Benesty J. Study of nonuniform linear differential microphone arrays with the minimum-norm filter. Applied Acoustics. 2015;98:62–69. [Google Scholar]
- 20.Kamiyanagida H, et al. Direction of arrival estimation based on nonlinear microphone array. Acoustics, Speech, and Signal Processing, 2001; Proceedings. (ICASSP'01). 2001 I.E. International Conference on; IEEE; 2001. [Google Scholar]
- 21.Miyabe S, et al. Analytical solution of nonlinear microphone array based on complementary beamforming 2008 [Google Scholar]
- 22.Omer M, Quadeer AA, Al-Naffouri TY, Sharawi MS. An L-shaped microphone array configuration for impulsive acoustic source localization in 2-D using orthogonal clustering based time delay estimation. 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA); Sharjah. 2013. pp. 1–6. [Google Scholar]
- 23.Tellakula AK. Degree Thesis. Bangalore, India: Supercomputer Education and Research Centre Indian Institute of Science; 2007. Acoustic source localization using time delay estimation. [Google Scholar]
- 24.Microphone Array Support in Windows. 2014 download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae.../micarrays.doc.
- 25.Widrow Bernard, Fa-Long Luo. Microphone arrays for hearing aids: An overview. Speech Communication. 2003;39.1:139–146. [Google Scholar]
- 26.Dmochowski JP, Benesty J, Affes S. A generalized steered response power method for computationally viable source localization. IEEE Transactions on Audio, Speech and Language Processing. 2007;15(8):2510–2526. [Google Scholar]
- 27.Zotkin DN, Duraiswami R. Accelerated speech source localization via a hierarchical search of steered response power. IEEE Transactions on Speech and Audio Processing. 2004;12(5):499–508. [Google Scholar]
- 28.Cedervall M, Moses RL. Efficient maximum likelihood DOA estimation for signals with known waveforms in the presence of multipath. IEEE Transactions on Signal Processing. 1997;45(3):808–811. [Google Scholar]
- 29.Vorobyov SA, Gershman AB, Wong KM. Maximum likelihood direction-of-arrival estimation in unknown noise fields using sparse sensor arrays. IEEE Transactions on Signal Processing. 2005;53(1):34–43. [Google Scholar]
- 30.Tang H. Thesis. Linnæus University; 2014. DOA estimation based on MUSIC algorithm. [Google Scholar]
- 31.Roy R, Paulraj A, Kailath T. Direction-of-arrival estimation by subspace rotation methods-ESPRIT. Acoustics, Speech, and Signal Processing; IEEE International Conference on ICASSP'86; IEEE; 1986. [Google Scholar]
- 32.Malioutov D, Çetin M, Willsky AS. A sparse signal reconstruction perspective for source localization with sensor arrays. IEEE Transactions on Signal Processing. 2005:3010–3022. [Google Scholar]
- 33.Tashev IJ. Sound capture and processing: practical approaches. John Wiley & Sons; 2009. [Google Scholar]
- 34.Brandstein M, Ward D, editors. Microphone arrays: signal processing techniques and applications. Springer Science & Business Media; 2013. [Google Scholar]
- 35.Dmochowski J, Benesty J, Affès S. On spatial aliasing in microphone arrays. IEEE Transactions on Signal Processing. 2009;57(4):1383–1395. [Google Scholar]
- 36.Wolfe PJ, Godsill SJ. Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement. EURASIP Journal on Applied Signal Processing. 2003;2003:1043–1051. [Google Scholar]























