Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Mar 18.
Published in final edited form as: IEEE/ACM Trans Audio Speech Lang Process. 2019 Aug 12;27(11):1839–1848. doi: 10.1109/taslp.2019.2934319

Deep Learning for Talker-dependent Reverberant Speaker Separation: An Empirical Study

Masood Delfarah 1, DeLiang Wang 1
PMCID: PMC7970708  NIHMSID: NIHMS1538979  PMID: 33748321

Abstract

Speaker separation refers to the problem of separating speech signals from a mixture of simultaneous speakers. Previous studies are limited to addressing the speaker separation problem in anechoic conditions. This paper addresses the problem of talker-dependent speaker separation in reverberant conditions, which are characteristic of real-world environments. We employ recurrent neural networks with bidirectional long short-term memory (BLSTM) to separate and dereverberate the target speech signal. We propose two-stage networks to effectively deal with both speaker separation and speech dereverberation. In the two-stage model, the first stage separates and dereverberates two-talker mixtures and the second stage further enhances the separated target signal. We have extensively evaluated the two-stage architecture, and our empirical results demonstrate large improvements over unprocessed mixtures and clear performance gain over single-stage networks in a wide range of target-to-interferer ratios and reverberation times in simulated as well as recorded rooms. Moreover, we show that time-frequency masking yields better performance than spectral mapping for reverberant speaker separation.

Index Terms: Cochannel speech separation, two-stage network, deep neural networks, speech dereverberation

I. Introduction

Sounds recorded in real acoustic scenes are usually distorted by room reverberation. These distortions, which are a result of the sound reflections from surrounding walls and objects, cause a challenge to human listeners and speech processing systems alike. A more severe kind of distortion occurs when target speech signal is also corrupted by the presence of other sound sources. Perceptual studies on speech intelligibility report that human listeners, particularly those with hearing impairment, have trouble understanding speech in noisy and reverberant conditions [3], [8], [13].

Monaural speech separation aims at separating a target speech signal from a single-microphone recording that contains additive and convolutive interference. Due to its wide applicability, monaural speech separation has been studied for decades. Traditional methods include speech enhancement [24], such as spectral subtraction, and computational auditory scene analysis [32], such as pitch-based separation of voiced speech. In recent years, supervised learning techniques, particularly deep learning algorithms, have elevated speech separation performance by large margins [33]. In these studies, deep neural networks (DNNs) are typically used to learn a mapping from a mixture signal to the clean signal or its ideal time-frequency (T-F) mask. For instance, the first such study by Wang and Wang [36] used a deep feedforward network (DFN) to estimate the ideal binary mask for speech separation. Subsequent studies demonstrated that DNN based monaural separation improves human speech intelligibility in noisy environments [12], [9].

One kind of speech separation is speaker separation, where the interference is one or multiple competing talkers. Deep learning methods have also been employed to address the speaker separation problem. Previous studies [6], [16], [17], [41] trained DNN models to separate two-talker mixtures in anechoic environments. Recently, we showed that a DNN produces significant speech intelligibility benefits for human listeners [11]. These studies can be categorized as talker-dependent speaker separation as the speakers to be separated are the same as those used in training. Other kinds of speaker separation are target-dependent and talker-independent [33]. In target-dependent speaker separation [6], [41] the target speaker is assumed to be known and used during training, while interfering speakers can be unknown and untrained. In talker-independent separation, test speakers can be all untrained. Significant advances have been achieved recently on such a task [14], [23], [25], [30], [37]. Although talker-independent speaker separation is least constrained in terms of applicability, there are application scenarios where talker-dependent or target-dependent separation is a natural choice. One such scenario is when speaker separation is applied to a small number of registered speakers, as in the case of Alibaba’s Tmall Genie, an Echo-like voice assistant, that features speaker recognition. Our study focuses on talker-dependent speaker separation. We will also compare with target-dependent and talker-independent models, demonstrating that broader speaker separation may come at the expense of performance loss.

Although DNNs have been used to enhance noisy and reverberant speech [42], no previous study, to our knowledge, has addressed the reverberant speaker separation problem in monaural recordings except for [38] where a single-channel scenario is evaluated as a baseline for multi-channel talker-independent speaker separation. In this paper, we investigate this problem by using recurrent neural networks (RNNs) with BLSTM [15]. As room reverberation exhibits strong temporal structure, RNNs should be more suited than DFNs for speech dereverberation. This is indeed what is found in our study. Motivated by a recent two-stage model for speech dereverberation and denoising [42], we propose two-stage networks to tackle the challenge of reverberant speaker separation. We find that two-stage networks outperform single-stage DNNs. In addition, our empirical investigation shows that T-F masking yields better results than spectral mapping. Other studies have also used two-stage networks for the separation problem. The study in [10] addresses speech-music separation where the first stage separates speech and music, and the second stage enhances each of the sources. Another study [34] performs speaker separation using a gender-mixture detection network followed by a separation network. Unlike [42], reverberation is not considered in these studies.

A preliminary version of this paper has appeared in [5]. Compared to the previous conference version, this paper introduces the second stage DNN for further enhancement of the separated target signal. In addition, more comprehensive experiments are conducted and new comparisons are made with speaker-independent and target-dependent methods.

The rest of the paper is organized as follows. In Section II we describe the baseline single-stage model and the proposed two-stage system. Section III presents experimental results. We conclude the paper in Section IV.

II. Proposed method

Let us define the anechoic target speech signal as s1(t) and the anechoic interfering speech signal as S2(t). We assume that s1(t) and s2(t) are convolved with different room impulse responses (RIRs) h1(t) and h2(t), respectively. Then, the reverberant mixture signal y(t) can be described as:

y(t)=h1(t) * s1(t)+h2(t) * s2(t) (1)

where symbol * denotes convolution.

We study different DNN architectures to separate the direct sound s1(t) from y(t). The goal of our separation is to improve the speech intelligibility and quality for human listeners. In other words, we intend to separate the target speaker from the interferer and, at the same time, dereverberate the target utterance since interfering speech and room reverberation both adversely affect speech perception [3], [28].

A. Feature extraction

The mixture signal y(t) is sampled at 16 kHz and windowed into 20-ms frames with 10-ms frame shift. In each time frame we extract 31-dimensional (31-D) power-normalized cepstral coefficients (PNCC) [21], 31-D gammatone frequency cepstral coefficients (GFCC) [29], and 40-D log-mel filterbank (LOG-MEL) features. These feature choices are made on the basis of our recent feature study for reverberant speech separation [4], where a detailed description for each of these features can be found. This feature study concludes that PNCC, GFCC, and LOG-MEL form a complementary feature set.

Let F(m) represent the 102-D input feature vector, where m is the time frame index. From the entire training set, mean (μF) and standard deviation (σF) is calculated in each feature dimension. Then, the zero-mean and unit-variance normalized feature vector, F¯(m), is obtained as follows:

F¯(m)=F(m)μFσF (2)

The same μF and σF values are used for feature normalization during the cross-validation and the test phase. To encode temporal information in the input signal we concatenate frames to form the following feature vector:

F¯a,b(m)=[F¯(ma),,F¯(m),,F¯(m+b)] (3)

where a and b denote the number of the past and future frames with respect to the current frame.

B. Training targets

Applying short-time Fourier transform (STFT) to s1(t), s2(t), and y(t) results in the complex STFT representations as S1, S2, and Y, respectively. In this study, we aim at obtaining the magnitude spectrogram of the anechoic target signal |S1|. Then the obtained magnitude spectrogram along with the mixture phase produces the separated target signal using the overlap-add method [1].

Our study considers two different training targets. The first is simply the log-magnitude spectrograms of the two sources [log|S1|, log|S2|]. Such a training target is commonly known as mapping-based [6]. An alternative target is the ideal ratio mask (IRM) [35]:

IRM=[|S1||S1|+|YS1|,|S2||S2|+|YS2|] (4)

In this case, a DNN generates an estimated ratio mask, and the separated magnitude spectrograms, |Ŝ1| and |Ŝ2|, are obtained by point-wise multiplication of the mixture magnitude spectrogram and each of the estimated ratio masks. This training target is masking-based as used in [16], [17], [39].

In [41], the mapping-based and masking-based DNNs were studied for speaker separation and it was reported that the two kinds of training targets have relative advantages in different conditions. We will compare those training objective functions for speaker separation in reverberant conditions.

C. Baseline one-stage networks

The one-stage system is illustrated in Fig. 1(a). In this study, RNNs with BLSTM are used due to their strong representational capacity, particularly for temporal patterns.

Fig. 1:

Fig. 1:

Illustration of reverberant speaker separation. (a) Diagram of a baseline one-stage system and (b) diagram of the proposed two-stage system.

Our RNN consists of 4 BLSTM layers with 500 units in each layer (250 units per direction). An output layer is stacked on top of the BLSTM layers. The activation function for the units in the output layer is linear for mapping-based target. Since the IRM ranges in [0, 1], we use the sigmoid function as the output layer activation for the masking-based target. During training, the network is unrolled for 100 time frames to perform the truncated backpropagation through time (BPTT) [40] and update the weights. To predict one output frame, the BLSTM network is fed by one feature frame, i.e. F¯0,0(.), without using neighboring frames, because the memory cells in the RNN contain the past and future contextual information.

Our BLSTM makes a prediction after observing the whole utterance, which is non-casual. We also evaluate a casual RNN with unidirectional LSTMs which proceeds from the past to the current frame. To have a fair comparison with BLSTM, we provide the LSTM with the feature vector F¯0,7(.), which includes 7 future frames. The same network hyperparameters used in BLSTM are used in LSTM.

To contrast feedforward and recurrent networks, we also generate a baseline with DFNs. A DFN layer has less trainable parameters than an LSTM layer with the same number of units, and for a fair comparison a DFN with more units is required. We use DFNs with 4 hidden layers, each consisting of 2000 rectified linear units (ReLU) [26]. Input features used in this case are F¯7,7(.).

In each network, the IRM or log-spectrograms are predicted frame by frame. The mean square error (MSE) loss function L is:

L(D(m,:);Θ)=1Cc=1C(D(m,c)G(F¯(m)))2 (5)

where D is the desired target (i.e. the IRM or log-magnitude spectrogram), C is the number of frequency channels, Θ represents the DNN parameters, and G(.) represents the neural network operation. The Adam optimizer [22] with the learning rate of 3 × 10−4 is used to minimize L. In each network, the learning algorithm is run for 50 epochs, and Θ with the least MSE on the validation set is chosen and used during the test phase.

D. Two-stage networks

Zhao et al. [42] recently proposed a two-stage network to address the problem of noisy-reverberant speech separation. Their first stage is a masking-based DFN that separates additive noise from reverberant speech signal. The denoised signal is fed into the second stage which is a mapping-based DFN to perform dereverberation. Each DFN is trained separately, and then the two networks are trained jointly. During the test phase, the network performs speech denoising and dereverberation given only a noisy-reverberant signal. One potential drawback of Zhao et al.’s architecture is that the second stage is provided with only the output from the first stage and does not directly operate on the input signal. The first-stage output is itself distorted and the discriminative power of the original acoustic features lost in the first stage would not be recovered by the second network. For speaker separation in reverberant conditions, we attempted to extend their approach so that the first stage is trained to separate the two reverberant speakers and the second is trained to dereverberate the target speaker. However, such an extension did not achieve satisfactory performance. Instead, we propose a different two-stage network for this problem.

The proposed two-stage system is depicted in Fig. 1(b). In the first stage, a DNN is trained to separate and dereverberate the target and interferer signals. This stage can be mapping-based or masking-based. The network output corresponding to the target speaker is converted to the log-magnitude spectrogram feature and normalized to zero mean and unit variance. This is concatenated to the mixture feature and used to train the second-stage network. The purpose of the second stage is further dereverberation of the initially separated and dereverberated target speaker. Because of the difficulty of combined separation and dereverberation, it is unlikely that a single DNN can achieve a high level of performance. With the output of the first stage, as well as the original mixture feature, the learning task of the second DNN is more focused. Hence the second stage is expected to attenuate or remove residual reverberation in the first-stage output. The second-stage network can also be mapping-based or masking-based DNN. The training targets for the target speaker are the same in both stages. Finally, the two networks are jointly trained for further fine tuning.

Four different two-stage networks can be constructed by combining the masking-based and mapping-based methods. Table I shows how the final output is calculated in each combination. We train each of four two-stage architectures using DFNs, LSTMs, and BLSTMs. Each stage network is basically the same as its corresponding single-stage network i.e. four LSTM, BLSTM, or DFN layers. Each stage network is first trained separately with the learning rate of 3 × 10−4 and then the two networks are jointly trained with the learning rate of 3 × 10−7.

TABLE I.

Calculation of the estimated target spectrogram in different two-stage networks. G(1)() and G(2)() denote the first and second stage DNN. μO1 and σO1 denote normalization parameters for the output of the first stage, and μO1 and σO2 the parameters for the second stage.

Combination DNN formula
Mapping+Mapping |S^1|=exp(G(2)([G(1)(F¯(m))μo1σo1,F¯(m)])×σo2+μo2)
Mapping+Masking |S^1|=G(2)([G(1)(F¯(m))μo1σo1,F¯(m)])×|Y|
Masking+Mapping |S^1|=exp(G(2)([log(G(1)(F¯(m))×|Y|)μo1σo1,F¯(m)])×σo2+μo2)
Masking+Masking |S^1|=G(2)([log(G(1)(F¯(m))×|Y|)μo1σo1,F¯(m)])×|Y|

In the two-stage BLSTM no neighboring frames are used in the input or the output. To train the two-stage LSTMs, F¯0,7(.) is fed to the first-stage network to predict the output for the current and the 3 future frames. This output is concatenated with F¯0,3(.) and passed to the second-stage to predict a single output frame. On the other hand, the two-stage DFNs use F¯7,7(.) in the first stage to predict 7 consecutive output frames, centered at the current frame. Then the second-stage network concatenates F¯3,3(.) with the first-stage output to predict a single output frame.

In order to evaluate the potential speech intelligibility benefits of separated speech, we use the Extended Short-time Objective Intelligibility (ESTOI) [20] metric, which is shown to strongly correlate with human intelligibility scores. ESTOI is a number mainly between 0 and 1 and a higher score indicates better intelligibility. We use Perceptual Evaluation of Speech Quality (PESQ) [27] to evaluate the quality of the separated target signals. PESQ score is a number between −0.5 and 4.5 and higher score indicates better speech quality. We also use signal-to-distortion ratio improvement, or ΔSDR, which is another widely used speech separation evaluation metric [31]. The anechoic target is used as the reference signal in these evaluations.

III. Evaluation results and comparisons

A. Experimental setup

The speech corpus used in this empirical study consists of 1440 IEEE sentences [19] uttered by a male and a female speaker. In the experiments, we arbitrarily designate the male speaker as the target and the female as the interferer. From this set, we randomly choose and set aside 120 female and 120 male utterances for testing and the rest are used for training. To generate the training mixtures, one male utterance and one female utterance are randomly picked. In the case that the interfering signal is shorter, it is repeated until it covers the whole target sentence. We use the image method [2] to generate simulated RIRs in a room with the dimensions of (6.5, 8.5, 3) m, by placing a microphone at (3, 4, 1.5) m. The reverberation time (T60) is sampled from the continuous range [0.3, 1.0] s. The male speaker is randomly placed at 1 m and the interferer at 2 m distance from the microphone at the same elevation. Then the reverberant male and female signals are mixed at a random target-to-interferer energy ratio (TIR) sampled from the continuous range [−12, 12] dB.

In total, 100,000 mixtures are generated for training and 1000 mixtures for validation. To train the single-stage networks, the entire training set is used. In the two-stage cases, the first stage is trained with half of the training set. Then, the second half is passed through the first stage and used to train the second stage. Finally, the joint training is done using the whole training set.

To generate simulated reverberant test mixtures, we use a different simulated room with the dimensions of (6, 8, 3) m, where the microphone position is set to (3.5, 2.5, 1.2) m. Target speaker is randomly placed at a 1 m distance and the interferer at a 2 m distance from the microphone. Note that since the test room is different from the training room, no RIRs are common between the test and training sets. Test T60 is chosen from {0.3, 0.6, 0.9} s and test TIR from {−12, −6} dB. In each condition, the networks are tested using 2000 mixtures and average scores are reported.

In order to further evaluate the generalization of the systems to real room conditions, we also perform experiments using recorded RIRs. For this purpose, we use the recorded RIRs from [18], which consist of recordings from four rooms with T60 = {0.32, 0.47, 0.68, 0.89} s. Aside from T60, direct-to-reverberant energy ratio (DRR) is an important characteristic of a reverberant signal. In general, a lower DRR entails a more challenging processing condition. Table II shows DRRs for our simulated and recorded test rooms.

TABLE II.

Average DRR values (dB) in different room conditions.

Simulated room Recorded room
T60 (s) 0.3 0.6 0.9 0.32 0.47 0.68 0.89
Target DRR 3.3 −1.4 −3.7 6.1 5.3 8.8 6.1
Interferer DRR −2.7 −7.4 −9.7

B. Single-stage reverberant speaker separation

To provide a baseline for reverberant speaker separation we first present results in anechoic conditions. Table III shows ESTOI, PESQ, and SDR scores with DNNs trained and tested in anechoic conditions. The results indicate that the masking-based systems perform better than the mapping-based systems. In addition, BLSTM and LSTM outperform DFN, showing that recurrent networks can better separate the speakers. A masking-based BLSTM achieves the best intelligibility and quality scores.

Table III.

ESTOI, PESQ, and ΔSDR scores for speaker separation in anechoic conditions. Single-stage networks are trained with anechoic data. Boldface highlights the best result in each condition.

ESTOI (%) PESQ ΔSDR (dB)
TIR (dB) −12 −6 Average −12 −6 Average −12 −6 Average
Unprocessed 24.6 36.4 30.5 1.35 1.58 1.46
DFN Mapping 62.4 73.6 68.0 2.35 2.65 2.50 14.15 10.77 12.46
Masking 63.7 76.2 69.9 2.40 2.75 2.57 15.83 12.94 14.28
LSTM Mapping 67.9 77.3 72.6 2.47 2.76 2.61 14.60 11.14 12.87
Masking 69.1 79.7 74.4 2.52 2.86 2.69 16.26 13.32 14.79
BLSTM Mapping 71.6 79.9 75.7 2.61 2.87 2.74 15.24 11.70 13.47
Masking 72.0 81.5 76.7 2.62 2.94 2.78 16.77 13.58 15.17

We train single-stage mapping and masking-based DFNs, LSTMs, and BLSTMs to perform speaker separation and speech dereverberation. Objective scores for simulated and recorded RIR conditions for TIR of −12 dB are presented in Figure 2. Similar to anechoic conditions, BLSTMs outperform LSTMs and DFNs, and T-F masking outperforms spectral mapping. These observations are consistent across simulated and real room conditions, and the amounts of improvement are also comparable between simulated and real room conditions. Note that the systems are trained using only simulated RIRs, and the substantial improvements in the real rooms suggest that the trained DNNs are capable of generalizing to different reverberant conditions.

Fig. 2:

Fig. 2:

Separation performance in reverberant environments for single-stage mapping-based and masking-based DFN, LSTM, and BLSTM. Test results are shown for different T60 values with TIR = −12 dB. (a) ESTOI scores in simulated RIR conditions, (b) ESTOI scores in recorded RIR conditions, (c) PESQ scores in simulated RIR conditions, and (d) PESQ scores in recorded RIR conditions,.

From Fig. 2, we observe that the scores decrease with the increase of T60. This trend is most evident when comparing with the performance in anechoic conditions in Table III. For example, the masking-based BLSTM achieves 47.4% ESTOI improvement in the anechoic condition with TIR = −12 dB, and this improvement is reduced to 36.3% at T60 = 0.3 s with the same TIR. This indicates that separation in reverberant conditions is in general more challenging.

C. Two-stage reverberant speaker separation

We train the four combinations of masking-based and mapping-based networks shown in Table I. Table IV gives ESTOI scores using two-stage DNNs in simulated reverberant test conditions. The results from single-stage networks are also included in this table for reference. As seen in the table, two-stage networks in general outperform single-stage networks. Our experiments demonstrate that using two masking-based systems is the best combination. Among DFNs, LSTMs, and BLSTMs, we observe that the BLSTMs provide the smallest advantage by using a two-stage system. On the other hand, the combination of two masking-based BLSTMs achieves the best performance among all the DNNs evaluated. We also observe that using a mapping-based network either in the first stage or the second stage does not perform as well as a masking-based network. Table VI shows the corresponding PESQ results, which exhibit a trend similar to that of ESTOI results. Again, the best results are achieved by the masking+masking BLSTM system.

TABLE IV.

ESTOI (%) scores for different two-stage and single-stage DFNs, LSTMs and BLSTMs in simulated reverberant conditions. In each two-stage DNN, * indicates that the score is significantly better than the masking-based single-stage DNN baseline score (with the significance level of p < 0.0005).

T60 (s) 0.3 0.6 0.9
TIR (dB) −12 −6 −12 −6 −12 −6 Average
Unprocessed 21.7 33.3 14.2 23.5 10.0 17.8 20.1
DFN Single-stage Mapping 41.9 54.9 32.3 46.4 25.5 38.5 39.9
Single-stage Masking 49.5 62.8 37.6 51.5 29.6 43.1 45.7
Mapping+Mapping 44.8 57.7 34.1 47.6 26.6 39.6 41.7
Mapping+Masking 49.4 62.7 37.1 50.9 29.0 42.2 45.2
Masking+Mapping 49.6 62.0 37.5 51.2 29.2 42.9 45.4
Masking+Masking 53.1* 66.2* 40.0* 53.9* 31.3* 45.0* 48.2*
LSTM Single-stage Mapping 48.2 60.0 37.2 50.5 29.3 42.5 44.6
Single-stage Masking 54.1 66.3 41.5 55.2 32.8 46.7 49.4
Mapping+Mapping 49.5 60.8 37.5 50.6 29.2 42.4 45.0
Mapping+Masking 53.1 65.1 40.4 53.6 31.9 45.1 48.2
Masking+Mapping 53.2 64.4 40.2 53.2 31.4 44.6 47.8
Masking+Masking 55.5* 67.6* 42.2* 55.6* 33.3* 46.8* 50.2*
BLSTM Single-stage Mapping 52.9 64.5 41.6 54.8 32.8 46.8 48.9
Single-stage Masking 56.1 68.0 44.9 58.4 35.4 50.1 52.1
Mapping+Mapping 53.7 64.3 40.5 54.6 30.4 45.4 48.1
Mapping+Masking 54.2 65.4 41.8 55.4 32.2 46.9 49.3
Masking+Mapping 57.4* 68.6* 44.8 58.5 34.5 49.4 52.2
Masking+Masking 58.0* 69.8* 45.5* 59.3* 36.0* 50.8* 53.2*

TABLE VI.

PESQ scores for different two-stage and single-stage DFNs, LSTMs and BLSTMs in simulated reverberant conditions.

T60 (s) 0.3 0.6 0.9
TIR (dB) −12 −6 −12 −6 −12 −6 Average
Unprocessed 1.40 1.51 1.49 1.51 1.56 1.56 1.50
DFN Single-stage Mapping 1.77 2.07 1.59 1.91 1.46 1.77 1.76
Single-stage Masking 1.97 2.29 1.76 2.07 1.62 1.91 1.94
Mapping+Mapping 1.84 2.15 1.65 1.95 1.52 1.80 1.82
Mapping+Masking 1.95 2.28 1.75 2.04 1.62 1.89 1.92
Masking+Mapping 1.96 2.27 1.73 2.04 1.58 1.87 1.91
Masking+Masking 2.06* 2.40* 1.82* 2.14* 1.67* 1.97* 2.01*
LSTM Single-stage Mapping 1.97 2.23 1.77 2.04 1.63 1.90 1.92
Single-stage Masking 2.12 2.42 1.88 2.18 1.73 2.02 2.06
Mapping+Mapping 1.97 2.23 1.76 2.04 1.63 1.90 1.92
Mapping+Masking 2.08 2.37 1.85 2.13 1.70 1.97 2.02
Masking+Mapping 2.05 2.33 1.81 2.09 1.66 1.94 1.98
Masking+Masking 2.14 2.46* 1.88 2.19 1.72 2.02 2.07
BLSTM Single-stage Mapping 2.05 2.31 1.85 2.13 1.71 1.99 2.01
Single-stage Masking 2.18 2.47 1.97 2.26 1.81 2.10 2.13
Mapping+Mapping 2.14 2.42 1.89 2.18 1.68 1.99 2.05
Mapping+Masking 2.17 2.48 1.93 2.23 1.74 2.06 2.10
Masking+Mapping 2.22 2.48 1.94 2.24 1.76 2.06 2.12
Masking+Masking 2.26* 2.57 2.00* 2.31* 1.82* 2.14* 2.18*

As mentioned earlier, in a two-stage system, two single-stage networks are first trained separately and then jointly. The results indicate that the two-stage masking+mapping system outperforms single-stage mapping. To see how much improvement is due to joint training, Figure 3 compares a masking+mapping DFN with and without joint training as well as a single-stage DFN. A simulated room condition with T60 = 0.9 s and TIR = −6 dB is used for this comparison. As seen in the figure, the performance gain of the two-stage network is mostly because of joint training. We have also observed this in other DNN architectures.

Fig. 3:

Fig. 3:

A single-stage mapping-based DFN is compared with a two-stage masking+mapping DFN with and without joint training. The test condition is a simulated room with T60 = 0.9 s and TIR = −6 dB.

Fig. 4 illustrates IRM prediction from a mixture signal from a reverberant male and female utterance. Broadly speaking, both single-stage masking and two-stage masking+masking networks are able to estimate the IRM well. On the other hand, the two-stage network is better at recovering finer spectrogram structures in the IRM and this explains the superior performance of the two-stage approach.

Fig. 4:

Fig. 4:

IRM prediction for reverberant speaker separation using BLSTMs. The test condition is for a simulated room with T60 = 0.3 s and TIR = −6 dB. Each panel is indicated by its corresponding label.

Next, we present separation results with recorded RIRs. It is worth emphasizing that no recorded RIR is used in training. Table VII shows ESTOI results for real room conditions. Similar to simulated reverberant conditions, the two-stage BLSTM system achieves the best results. Again, the masking+masking BLSTM system outperforms other DNN architectures. In Table VIII, we present PESQ results for these conditions. The trend of the PESQ scores is similar to that of ESTOI scores in Table VII.

TABLE VII.

ESTOI (%) scores for different two-stage and single-stage DFNs, LSTMs and BLSTMs in recorded reverberant conditions.

T60 (s) 0.32 0.47 0.68 0.89
TIR (dB) −12 −6 −12 −6 −12 −6 −12 −6 Average
Unprocessed 22.2 33.0 17.4 27.2 20.9 32.1 16.3 26.7 24.5
DFN Single-stage Mapping 39.4 50.3 33.5 44.7 38.2 49.7 28.6 39.7 40.5
Single-stage Masking 47.5 59.8 40.0 52.2 46.3 59.1 35.5 48.2 48.6
Mapping+Mapping 41.3 51.8 35.7 46.6 41.4 52.9 30.8 41.6 42.8
Mapping+Masking 47.2 59.2 39.8 51.7 46.3 59.0 35.1 47.6 48.2
Masking+Mapping 44.9 55.0 39.0 49.7 45.3 56.3 33.1 44.0 45.9
Masking+Masking 50.1* 62.0* 42.3* 54.1* 49.4* 61.8* 36.8* 49.5* 50.7*
LSTM Single-stage Mapping 44.1 53.8 38.2 49.0 44.6 55.3 33.1 43.8 45.2
Single-stage Masking 50.9 62.6 43.7 55.6 50.4 62.7 38.3 51.1 51.9
Mapping+Mapping 44.2 53.8 38.6 49.3 45.1 56.0 33.6 44.3 45.6
Mapping+Masking 49.9 61.2 42.8 54.5 49.6 61.9 38.0 50.4 51.0
Masking+Mapping 47.2 57.0 41.6 52.2 47.9 58.6 35.5 46.4 48.3
Masking+Masking 51.7 63.3 44.5 56.3 51.3* 63.6* 39.0 51.5 52.6
BLSTM Single-stage Mapping 48.1 57.8 42.0 52.7 47.5 58.6 36.1 47.3 48.8
Single-stage Masking 52.5 64.0 45.2 57.0 50.8 63.1 40.1 52.9 53.2
Mapping+Mapping 49.7 59.8 43.1 54.1 49.3 60.5 35.6 49.1 50.1
Mapping+Masking 50.6 62.1 43.4 55.0 50.1 61.9 37.8 52.1 51.6
Masking+Mapping 51.1 61.2 44.7 55.8 50.8 61.9 37.6 49.4 51.6
Masking+Masking 53.2 64.7* 46.0 57.8* 52.8* 64.9* 39.9 53.0 54.0*

TABLE VIII.

PESQ scores for different two-stage and single-stage DFNs, LSTMs and BLSTMs in recorded reverberant conditions.

T60 (s) 0.32 0.47 0.68 0.89
TIR (dB) −12 −6 −12 −6 −12 −6 −12 −6 Average
Unprocessed 1.37 1.55 1.35 1.46 1.37 1.52 1.45 1.54 1.45
DFN Single-stage Mapping 1.77 2.04 1.69 1.98 1.68 1.95 1.62 1.87 1.82
Single-stage Masking 1.98 2.31 1.86 2.18 1.87 2.20 1.76 2.05 2.03
Mapping+Mapping 1.82 2.07 1.76 2.03 1.76 2.04 1.67 1.92 1.88
Mapping+Masking 1.99 2.30 1.87 2.17 1.88 2.20 1.78 2.05 2.03
Masking+Mapping 1.90 2.16 1.83 2.12 1.86 2.14 1.76 2.00 1.97
Masking+Masking 2.06* 2.38* 1.93* 2.25* 1.96* 2.30* 1.82* 2.11* 2.10*
LSTM Single-stage Mapping 1.91 2.14 1.87 2.12 1.90 2.14 1.80 2.03 1.99
Single-stage Masking 2.12 2.42 2.01 2.32 2.04 2.35 1.91 2.19 2.17
Mapping+Mapping 1.89 2.11 1.85 2.11 1.89 2.14 1.80 2.01 1.97
Mapping+Masking 2.08 2.37 1.97 2.27 2.01 2.31 1.89 2.16 2.13
Masking+Mapping 1.96 2.20 1.92 2.19 1.95 2.21 1.84 2.07 2.04
Masking+Masking 2.14 2.45 2.03 2.35* 2.06 2.38* 1.93 2.22 2.19
BLSTM Single-stage Mapping 2.01 2.22 1.96 2.21 1.97 2.21 1.86 2.09 2.07
Single-stage Masking 2.19 2.48 2.08 2.37 2.09 2.38 1.98 2.26 2.23
Mapping+Mapping 2.03 2.27 1.97 2.24 1.97 2.27 1.86 2.08 2.09
Mapping+Masking 2.16 2.49* 2.05 2.39* 2.10 2.42* 1.95 2.24 2.22
Masking+Mapping 2.13 2.36 2.05 2.32 2.07 2.34 1.94 2.18 2.17
Masking+Masking 2.23* 2.53* 2.12* 2.43* 2.15* 2.46* 2.01 2.29* 2.28*

D. Comparisons with talker-independent and target-dependent separation

Our model is trained and tested using a fixed pair of speakers whose utterances have been used during training. Such speaker separation is talker-dependent. Deep learning models have been recently developed to perform talker-independent speaker separation, i.e. test speakers can be different from training speakers. One prominent method is utterance-level permutation invariant training (uPIT) [23]. This algorithm calculates the training loss by accounting for speaker permutations across time frames and then optimizes the network using the minimum loss of different permutations. To get an idea on whether our talker-dependent separation yields an expected improvement over talker-independent separation, we train a uPIT model and compare with our two-stage network. To this end, we use the WSJ0 corpus [7] to generate reverberant mixtures for uPIT. All signals are downsampled to 16 kHz and frame size is set to 20 ms with the frame shift of 10 m. Experiments are performed as described in Section III-A. The uPIT network with BLSTM is optimized to estimate the IRM for each of the two anechoic utterances via minimizing the utterance-level loss.

A target-dependent speaker separation model aims to separate a trained target speaker from an open set of interfering speakers [6], [41]. To train a target-dependent model we use the IEEE male speaker as the target and WSJ0 speakers as interferers. A BLSTM network is trained to predict the IRM for the anechoic target utterances via the loss function in (5).

The results are shown in Fig. 5. As expected, in both simulated and recorded RIR conditions, our two-stage talker-dependent model outperforms the target-dependent model, which in turn outperforms uPIT. Furthermore, uPIT improves both ESTOI and PESQ scores of unprocessed mixtures. All these improvements are statistically significant (p < 0.0005). These consistent results suggest that speaker-specific information, when available, contributes to speaker separation performance. The comparative results between the speaker-dependent and target-dependent models further suggest that interferer information also plays a role in separation performance.

Fig. 5:

Fig. 5:

Comparison of the proposed two-stage network with target-dependent and talker-independent speaker separation. All networks use BLSTM and test TIR is −12 dB. (a) ESTOI scores in simulated RIR conditions, (b) ESTOI scores in recorded RIR conditions, (c) PESQ scores in simulated RIR conditions, (d) PESQ scores in recorded RIR conditions, (e) ΔSDR scores in simulated RIR conditions, and (f) ΔSDR scores in recorded RIR conditions. Error bars depict the standard deviation.

Our evaluation so far is on a pair of male-female speakers. We expect a similar pattern of results for same-gender pairs, although the results are expected to be a little worse. To verify this, we choose a new pair of male speakers, both uttering the IEEE corpus, with one of them designated as the target speaker. This evaluation is conducted in simulated reverberant conditions at three T60s (0.3, 0.6, 0.9 s) and recorded reverberant rooms at four T60s (0.32, 0.47, 0.68, 0.89 s) with one TIR (−6 dB). The same-gender results and comparisons are presented in Table V. Like the male-female results, talker-dependent speaker separation outperforms target-dependent separation, which in turn yields better results than talker-independent separation. Furthermore, the two-stage network performs better than the single-stage network.

TABLE V.

Separation results and comparisons for a pair of male-male speakers. Results are shown at TIR = −6 dB, averaged over all T60 conditions.

Metric ESTOI PESQ ΔSDR
Unprocessed 23.9 1.21 0.00
Talker-independent 36.4 1.56 3.19
Target-dependent 39.2 1.77 3.41
One-stage talker-dependent 49.5 2.00 5.26
Two-stage talker-dependent 50.1* 2.02 5.40*

IV. Concluding remarks

In this paper, we have proposed two-stage deep neural networks for the speaker separation in reverberant conditions. We have compared the performances of BLSTMs, LSTMs, and DFNs, and our experimental results show that recurrent networks outperform feedforward networks in a wide range of conditions, with BLSTMs performing the best. We have also shown that masking-based separation outperforms mapping-based separation. Our empirical study shows that talker-dependent speaker separation in reverberant conditions yields better results than target-dependent models, which in turn perform better than talker-independent separation. This observation is expected as talker-dependent models operate in more constrained conditions. To our knowledge, this is the first study to address monaural speaker separation in reverberant conditions using RNNs. In the future we plan to extend the current system to speech separation conditions with both background noise and interfering speakers.

Acknowledgments

This research was supported in part by an NIDCD grant (R01DC012048) and the Ohio Supercomputer Center. We thank Yuzhou Liu for providing the uPIT code and Eric Johnson for help with statistical analysis.

References

  • [1].Allen JB, “Short term spectral analysis, synthesis, and modification by discrete fourier transform,” IEEE Trans. Audio, Speech, Lang. Process, vol. 25, pp. 235–238, 1977. [Google Scholar]
  • [2].Allen JB and Berkley DA, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Amer, vol. 65, pp. 943–50, 1979. [Google Scholar]
  • [3].Culling JF, Hodder KI, and Toh CY, “Effects of reverberation on perceptual segregation of competing voices,” J. Acoust. Soc. Amer, vol. 114, pp. 2871–2876, 2003. [DOI] [PubMed] [Google Scholar]
  • [4].Delfarah M and Wang DL, “Features for masking-based monaural speech separation in reverberant conditions,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 25, pp. 1085–1094, 2017. [Google Scholar]
  • [5].——, “Recurrent neural networks for cochannel speech separation in reverberant environments,” in Proc. ICASSP, 2018, pp. 5404–5408. [Google Scholar]
  • [6].Du J, Tu Y, Xu Y, Dai L, and Lee C-H, “Speech separation of a target speaker based on deep neural networks,” in Proc. ICSP, 2014, pp. 473–477. [Google Scholar]
  • [7].Garofolo J, Graff D, Paul D, and Pallett D, “CSR-I (WSJ0) complete LDC93S6A,” Philadelphia: Linguistic Data Consortium, 1993. [Google Scholar]
  • [8].George EL, Goverts ST, Festen JM, and Houtgast T, “Measuring the effects of reverberation and noise on sentence intelligibility for hearing-impaired listeners,” J. Speech Lang. Hear. Res, vol. 53, pp. 1429–1439, 2010. [DOI] [PubMed] [Google Scholar]
  • [9].Goehring T, Bolner F, Monaghan JJ, van Dijk B, Zarowski A, and Bleeck S, “Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users,” Hearing research, vol. 344, pp. 183–194, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Grais EM, Roma G, Simpson AJ, Plumbley MD, Grais EM, Roma G, Simpson AJ, and Plumbley MD, “Two-stage single-channel audio source separation using deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 25, pp. 1773–1783, 2017. [Google Scholar]
  • [11].Healy EW, Delfarah M, Vasko JL, Carter BL, and Wang DL, “An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker,” J. Acoust. Soc. Amer, vol. 141, pp. 4230–4239, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Healy EW, Yoho SE, Wang Y, and Wang DL, “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. Acoust. Soc. Amer, vol. 134, pp. 3029–3038, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Helfer KS and Wilber LA, “Hearing loss, aging, and speech perception in reverberation and noise,” J. Speech Lang. Hear. Res, vol. 33, pp. 149–155, 1990. [DOI] [PubMed] [Google Scholar]
  • [14].Hershey JR, Chen Z, Le Roux J, and Watanabe S, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35. [Google Scholar]
  • [15].Hochreiter S and Schmidhuber J, “Long short-term memory,” Neural Comput, vol. 9, pp. 1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
  • [16].Huang P-S, Kim M, Hasegawa-Johnson M, and Smaragdis P, “Deep learning for monaural speech separation,” in Proc. ICASSP, 2014, pp. 1562–1566. [Google Scholar]
  • [17].——, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 23, pp. 2136–2147, 2015. [Google Scholar]
  • [18].Hummersone C, Mason R, and Brookes T, “Dynamic precedence effect modeling for source separation in reverberant environments,” IEEE Trans. Audio, Speech, Lang. Process, vol. 18, pp. 1867–1871, 2010. [Google Scholar]
  • [19].IEEE, “IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroacoust, vol. 17, pp. 225–246, 1969. [Google Scholar]
  • [20].Jensen J and Taal CH, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE Trans. Audio, Speech, Lang. Process, vol. 24, pp. 2009–2022, 2016. [Google Scholar]
  • [21].Kim C and Stern RM, “Power-normalized cepstral coefficients (PNCC) for robust speech recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, pp. 1315–1329, 2016. [Google Scholar]
  • [22].Kingma D and Ba J, “Adam: A method for stochastic optimization,” in Proc. ICML, 2015. [Google Scholar]
  • [23].Kolbaek M, Yu D, Tan Z-H, Jensen J, Kolbaek M, Yu D, Tan Z-H, and Jensen J, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE Trans. Audio, Speech, Lang. Process, vol. 25, pp. 1901–1913, 2017. [Google Scholar]
  • [24].Loizou PC, Speech enhancement: theory and practice. Boca Raton, FL: CRC press, 2013. [Google Scholar]
  • [25].Luo Y and Mesgarani N, “TasNet: Surpassing ideal time-frequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Nair V and Hinton GE, “Rectified linear units improve restricted Boltzmann machines,” in Proc. ICML, 2010, pp. 807–814. [Google Scholar]
  • [27].Rix AW, Beerends JG, Hollier MP, and Hekstra AP, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752. [Google Scholar]
  • [28].Sayles M and Winter IM, “Reverberation challenges the temporal representation of the pitch of complex sounds,” Neuron, vol. 58, pp. 789–801, 2008. [DOI] [PubMed] [Google Scholar]
  • [29].Shao Y and Wang DL, “Robust speaker identification using auditory features and computational auditory scene analysis,” in Proc. ICASSP, 2008, pp. 1589–1592. [Google Scholar]
  • [30].Shi Z, Lin H, Liu L, Liu R, and Han J, “FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks,” arXiv preprint arXiv:1902.04891, 2019. [Google Scholar]
  • [31].Vincent E, Gribonval R, and Févotte C, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process, vol. 14, pp. 1462–1469, 2006. [Google Scholar]
  • [32].Wang DL and Brown GJ, Eds., Computational auditory scene analysis: Principles, algorithms, and applications. Hoboken, NJ: Wiley-IEEE Press, 2006. [Google Scholar]
  • [33].Wang DL and Chen J, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 26, pp. 1702–1726, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Wang Y, Du J, Dai L-R, and Lee C-H, “A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 1535–1546, 2017. [Google Scholar]
  • [35].Wang Y, Narayanan A, and Wang DL, “On training targets for supervised speech separation,” IEEE Trans. Audio, Speech, Lang. Process, vol. 22, pp. 1849–1858, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Wang Y and Wang DL, “Towards scaling up classification-based speech separation,” IEEE Trans. Audio, Speech, Lang. Process, vol. 21, pp. 1381–1390, 2013. [Google Scholar]
  • [37].Wang Z-Q, Le Roux J, and Hershey JR, “Alternative objective functions for deep clustering,” in Proc. ICASSP, 2018, pp. 686–690. [Google Scholar]
  • [38].Wang Z-Q and Wang DL, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE Trans. Audio, Speech, Lang. Process, vol. 27, pp. 457–468, 2018. [Google Scholar]
  • [39].Weninger F, Hershey JR, Le Roux J, and Schuller B, “Discriminatively trained recurrent neural networks for single-channel speech separation,” in Proc. GlobalSIP, 2014, pp. 577–581. [Google Scholar]
  • [40].Williams RJ and Peng J, “An efficient gradient-based algorithm for online training of recurrent network trajectories,” Neural Comput, vol. 2, pp. 490–501, 1990. [Google Scholar]
  • [41].Zhang X-L and Wang DL, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, pp. 967–977, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Zhao Y, Wang Z-Q, and Wang DL, “Two-stage deep learning for noisy-reverberant speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 27, pp. 53–62, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES