Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Nov 10.
Published in final edited form as: IEEE/ACM Trans Audio Speech Lang Process. 2020 Jul 8;28:2109–2118. doi: 10.1109/taslp.2020.3007779

Causal Deep CASA for Monaural Talker-Independent Speaker Separation

Yuzhou Liu 1, DeLiang Wang 2
PMCID: PMC7654633  NIHMSID: NIHMS1642365  PMID: 33178880

Abstract

Talker-independent monaural speaker separation aims to separate concurrent speakers from a single-microphone recording. Inspired by human auditory scene analysis (ASA) mechanisms, a two-stage deep CASA approach has been proposed recently to address this problem, which achieves state-of-the-art results in separating mixtures of two or three speakers. A main limitation of deep CASA is that it is a non-causal system, while many speech processing applications, e.g., telecommunication and hearing prosthesis, require causal processing. In this study, we propose a causal version of deep CASA to address this limitation. First, we modify temporal connections, normalization and clustering algorithms in deep CASA so that no future information is used throughout the deep network. We then train a C-speaker (C ≥ 2) deep CASA system in a speaker-number-independent fashion, generalizable to speech mixtures with up to C speakers without the prior knowledge about the speaker number. Experimental results show that causal deep CASA achieves excellent speaker separation performance with known or unknown speaker numbers.

Keywords: Monaural speaker separation, talker-independent speaker separation, deep CASA, causal processing

I. Introduction

INTERFERENCE from competing speakers is considered a major challenge in speech communication, automatic speech processing systems, and hearing prosthesis. Based on deep learning, many talker-independent monaural speaker separation algorithms have been proposed in recent years to address this problem. Two main approaches are deep clustering (DC) [6] and permutation invariant training (PIT) [13]. Deep clustering learns an embedding vector for each time-frequency (T-F) unit of the mixture. Clustering the embedding vectors results in binary T-F masks, which can be used to separate the speakers from the mixture. In PIT, each output layer in a deep neural network (DNN) is associated with one speaker in the mixture. During training, PIT examines the losses with respect to all possible output-speaker permutations, and optimizes the DNN using the minimum loss. Based on the types of output-speaker pairing, PIT can be categorized into frame-level PIT (tPIT), where the pairings can change frame by frame, and utterance-level PIT (uPIT), where the pairing is fixed throughout each training utterance. Many extensions have been proposed recently for DC and PIT, including [15], [19], [20], [22], [27], [28]. Inspired by research in computational auditory scene analysis (CASA) [25], we recently proposed deep CASA [18] which breaks down the speaker separation task into two stages, i.e., tPIT based simultaneous grouping and clustering based sequential grouping. Deep CASA achieves frame-level separation and speaker tracking in turn. Compared to one-stage PIT or DC which optimizes the two objectives at the same time, deep CASA substantially mitigates the mistakes in speaker tracking, and leads to improvements in speaker separation performance.

Although deep CASA produces the state-of-the-art speaker separation results, it has a major limitation from the viewpoint of real-world deployment: it is non-causal. Causal processing is a major requirement in many real-time speech applications, including telecommunication and hearing aids. For example, mobile communication involves real-time interaction and is sensitive to processing delay. For hearing prosthesis, a processing delay longer than 10 ms would create a misalignment between real and processed signals, hampering speech perception [5]. The deep CASA system [18] utilizes future information as long as 9 seconds for separation and speaker tracking, making it unsuitable for these applications. It is therefore important to develop a causal version of deep CASA.

Models based on uPIT can be easily extended to the causal version if causal DNNs are utilized [13], [20]. However, lacking future information for speaker tracking [1], causal uPIT significantly underperforms non-causal uPIT in a variety of settings, as demonstrated in [1], [13], [20]. On the other hand, even with causal DNNs, clustering based methods like DC [6] and deep attractor networks (DAN) [19] struggle to operate causally, as the centroids of clusters are hard to estimate in an online fashion. Recently, researchers start to incorporate uPIT as a parallel training target for DC based systems, and use the spectral outputs from uPIT during inference [1], [26]-[28]. In this way, speaker separation can be achieved causally without a clustering step [1].

Another challenge for real-world deployment is that the number of concurrent speakers is usually unknown beforehand. Again, DC and DAN fail to operate properly in such a scenario, as the speaker number is needed for clustering. To tackle this problem, Higuchi et al. [7] perform offline source counting by computing the rank of the covariance matrix of the embedding vectors. An accuracy of 67.3% is achieved for counting two- and three-speaker mixtures. The non-causal setup and mediocre performance in [7] make the study far from practical utility. On the other hand, a C-output uPIT model can be directly applied to speech mixtures with up to C speakers, without the prior knowledge about the speaker number, as some of the outputs can be trained to generate silence as a placeholder [13]. Another direction for speaker-number-independent separation is to recursively remove one speaker at a time from the mixture [12], [23]. In [23], a one-and-rest permutation invariant training (OR-PIT) algorithm is proposed to train such a network. A binary classifier is trained to produce the stopping signal for the system. Satisfactory results have been achieved on two- and three-speaker mixtures. However, it should be noted that the stopping signal generator needs the entire utterance as input, and can not be easily extended to causal processing.

This study aims to make deep CASA causal. First, all non-causal connections and normalization are replaced with their causal versions throughout the deep CASA network. We then propose two causal clustering algorithms for the sequential grouping stage, both matching the performance of non-causal clustering. Finally, we fine-tune a three-speaker deep CASA system with two-speaker mixtures. The proposed causal deep CASA algorithm achieves excellent results on both two- and three-speaker mixtures, with no knowledge about the speaker number.

The rest of the paper is organized as follows. Section II reviews the non-causal deep CASA system. Causal processing is introduced in Section III. Section IV presents experimental results and comparisons. Concluding remarks are given in Section V.

II. A Deep CASA Approach to Monaural Speaker Separation

Monaural speaker separation aims to separate C speakers xc(n), c = 1, … , C, from a single-microphone recording of speech mixture y(n), where y(n)=c=1C xc(n) and n indexes time. In this section, we review two versions of deep CASA in [18], namely a two-speaker version and a multi-speaker version. The systems are presented in two parts: simultaneous grouping and sequential grouping.

A. Simultaneous Grouping

Given the complex short-time Fourier transform (STFT) of the mixture Y(t, f), where t and f index frame and frequency, simultaneous grouping is performed to separate the C speakers at the frame level. C outputs, X^c(t,f) (c = 1, …, C), are generated to estimate the complex STFT of the C speakers. The training of simultaneous grouping follows the tpIT criterion, where the frame-level output-speaker pairing is chosen as the pairing that minimizes the l1 loss function over all possible speaker permutations. The outputs are then organized to C streams using the resulting tpIT pairings:

X^c(t,f)X^oc(t,f),c=1,,C (1)

Here oc denotes the mapping from speaker outputs to speaker streams, which can change across frames. Next, C time-domain signals, x^oc (n) (c = 1, …, C), are generated by applying inverse STFT to the organized streams. Finally, a signal-to-noise ratio (SNR) objective JtPIT–SNR is used to tune the network:

JtPITSNR=c=1C10lognxc(n)2n[xc(n)x^oc(n)]2 (2)

A Dense-UNet architecture is used for simultaneous grouping, as shown in Fig. 1. It consists a sequence of upsampling layers, downsampling layers, and dense convolutional blocks, and can be divided into two halves. In the first half, an alternation of dense convolutional blocks and downsampling layers projects the input feature map into a high level of abstraction. Dense blocks and upsampling layers are then alternated in the second half to restore the encoded features back to the original resolution. The dense blocks at the same hierarchical level in the two halves are linked with skip connections. The output layers in Dense-UNet estimate complex T-F masks for the C speakers, which are then multiplied with Y(t, f) to generate source estimates X^c(t,f). Other details, including the number of layers, downsampling, upsampling, dense convolutional blocks, and frequency mapping layers, follow those in [18].

Fig. 1.

Fig. 1.

Diagram of the Dense-UNet in simultaneous grouping. Gray, ‘DS’ and ‘US’ blocks denote dense convolutional blocks, downsampling layers and upsampling layers, respectively. Dense convolutional blocks at the same level are linked with skip connections. The inputs, masks and outputs are defined in the complex STFT domain.

B. Sequential Grouping

The sequential grouping stage tracks all frame-level spectral estimates X^c(t,f), and assigns them to the C speakers. Mixture spectrogram and C spectral estimates (including real, imaginary and magnitude STFT) are stacked to form the input to this stage. Based on the number of concurrent speakers, two versions of sequential grouping are presented as follows.

1). Two-Speaker Sequential Grouping:

When there are only two concurrent speakers, a DNN can be trained to project each frame-level input to a D-dimensional embedding vector V(t)R1×D. The target label is a two-dimensional indicator vector which gives a one-hot representation of the tpIT output assignment, denoted by A(t). During the training of tPIT, if the minimum loss is achieved when X^1(t) is paired with speaker 1, and X^2(t) is paired with speaker 2, we set A(t) to [1 0]. Otherwise, A(t) is set to [0 1]. A weighted objective function between V(T × D, where T denotes the total number of frames) and A(T × 2) is defined:

JDCW=W(VVTAAT)WF2 (3)

In the above equation, W denotes a T × T diagonal matrix whose main diagonal corresponds to a frame-level weight vector w(t)=LD(t)tLD(t), where LD(t) represents the frame-level loss difference (LD) between the two possible speaker assignments. ∥ · ∥F denotes the Frobenius norm.

Minimizing JDC–W forces V(t) corresponding to the same optimal assignment to get closer during training, and otherwise to become farther apart. Clustering V(t) with the K-means algorithm yields a binary label for each frame, which can be used to organize the frame-level outputs from simultaneous grouping. Deep CASA with such a sequential grouping stage is denoted by two-speaker deep CASA in this study.

Two-speaker sequential grouping works excellently in the case of two concurrent speakers, as there are only two possible output assignments, i.e., swap or not swap. The trained V(t) exhibits two unique patterns accordingly. However, when the number of speakers C increases, the number of possible assignments is C! = 1 × 2 × ⋯ × C, and it becomes intractable to use one vector V(t) to represent all the assignments. Even if V(t) can be trained to convey C! patterns, it is difficult to figure out the pattern-assignment pairing during inference.

2). Multi-Speaker Sequential Grouping:

To avoid the intractable embedding patterns, for a C-speaker (C ≥ 2) mixture, we use a DNN to predict C embedding vectors at each frame Vc(t)R1×D, each corresponding to one output X^c(t) of the Dense-UNet, as shown in Fig. 2. The target label for Vc(t) is a C-dimensional indicator vector, denoted by Ac(t). During the training of tPIT, if the minimum loss is achieved when X^c(t) is paired with speaker c′, the c′th element of Ac(t) is set to 1, and all other elements are set to 0. In other words, Ac(t) indicates the optimal speaker assignment of X^c(t). Similar to two-speaker sequential grouping, a weight wc(t)=LD(t)tLD(t) is used during training to emphasize frames where the speaker assignment plays an important role. Here LD(t) denotes the frame-level loss difference between the minimum and maximum loss. wc(t) can be used to construct a CT × CT diagonal weight matrix W = diag(wc(t)). Vc(t) and Ac(t) can be reshaped into a CT × D matrix V and a CT × C matrix A, respectively. The final weighted objective function between V and A is:

JDCW=W(VVTAAT)WF2 (4)
Fig. 2.

Fig. 2.

Diagram of multi-speaker sequential grouping.

Algorithm 1: Constrained Clustering.
Input:Embedding vectorsVc(t),K-means centroidsμcOutput:Frame-level labels of all outputsΘ(t)(resulting permutation)1:fortin{1,,T}do2:Θ(t)argmaxθ(t)PCc=1Vθc(t)(t)μcT3:endfor

Optimizing JDC–W forces Vc(t) corresponding to the same speaker to get closer during training, and Vc(t) corresponding to different speakers to become farther apart. The trained Vc(t) exhibits C unique patterns, each corresponding to one speaker.

During inference, the K-means algorithm is first applied to cluster Vc(t) into C groups. However, if no post-processing is conducted, several embeddings at one frame may be assigned to the same speaker. We thus design a constrained clustering algorithm to force the frame-level embeddings to different labels, as given in Algorithm 1. The input to the algorithm includes C centroids calculated using the K-means algorithm. In each frame, the resulting permutation Θ(t) corresponds to the assignment that maximizes the sum of similarities between embeddings and centroids. Here P denotes the union of all permutations. After the constrained clustering algorithm, frame-level outputs are organized according to their labels, and resynthesized to the time domain. Deep CASA with multi-speaker sequential grouping is denoted by multi-speaker deep CASA in this study.

A temporal convolutional network (TCN) [3], [14] is used as the sequence model for both two-speaker and multi-speaker sequential grouping. In the TCN, input features are fed to 8 consecutive dilated convolutional blocks, with an exponentially increasing dilation factor. The 8 blocks are repeated 3 more times before embedding estimation. A dropDilation technique is utilized to overcome the overfitting problem during training. Other details of the TCN follow those in [18].

To build deep CASA from scratch, the simultaneous grouping and sequential grouping modules need to be trained in turn separately. We have shown in [18] that the two modules can be further fine-tuned jointly with a smaller learning rate to produce smoother source estimates. In joint optimization, the outputs of Dense-UNet are organized using the estimated clustering labels, and compared with the clean sources to form an SNR objective. In the meantime, the sequential grouping module is tuned using the same weighted objective in Eq. (4). Joint optimization is applied in this study.

III. A Causal Extension to Deep CASA

In this section, we present causal deep CASA. To turn deep CASA into a causal version, four aspects need to be examined: temporal convolution, normalization, clustering and speaker-number-independent training.

A. Temporal Convolution

Dense-UNet and TCN consist of a series of temporal convolutional layers, which are non-causal in the original deep CASA system. The left part of Fig. 3(a) illustrates a non-causal temporal convolutional layer in TCN. To generate the output of frame T, future information from frame T + 1 is used, making the layer non-causal. In the causal extension, we change non-causal convolution to their causal versions when the temporal resolution stays the same in the input and output, as shown in the right part of Fig. 3(a).

Fig. 3.

Fig. 3.

Temporal convolution in deep CASA. (a) A temporal convolutional layer with matched temporal resolution in the input and output. Non-causal and causal versions are illustrated on the left and right, respectively. (b). Temporal downsampling and upsampling layers in non-causal Dense-UNet.

There are two special types of temporal convolutional layers in Dense-UNet, downsampling and upsampling layers. Temporal downsampling is achieved using strided convolutional layers of size 2. Upsampling layers are transpose convolutional layers of size 2. Fig. 3(b) illustrates one pass of temporal downsampling and upsampling. During the downsampling process, inputs from every two frames are encoded into one single unit, which halves the temporal resolution. The upsampling layer then projects the encodings to the original resolution. As a result of encoding, the output at frame T – 1 requires inputs at both frame T – 1 and T, making the layers non-causal. Since there is no solution to fix the non-causality of such layers, we remove all frame-wise downsampling and upsampling in Dense-UNet, but keep the frequency-wise downsampling and upsampling.

B. Normalization

Normalization is utilized extensively in deep CASA to accelerate training and stabilize neuron activations. Empirical results indicate that the choice of normalization significantly impacts the performance of speaker separation [20]. In non-causal deep CASA, standard layer normalization (LN) [2] is adopted, where the features are normalized over all but the batch dimension. Take Dense-UNet as an example. Feature maps in Dense-UNet have 4 dimensions: zRB×T×F×K, where B, T, F, K denote the batch size, the number of frames, frequency bins, and channels, respectively. A global mean and variance are calculated for each training sample in a batch, and are then utilized to normalize the feature map:

E[z]=1TFKt,f,kz(b,t,f,k) (5)
Var[z]=1TFKt,f,k(z(b,t,f,k)E[z])2 (6)
LN(z)=zE[z]Var[z]+ϵγ+β (7)

where γ, βR1×1×1×K are trainable gain and bias, ϵ is a small constant added to variance to avoid dividing by zero, and ⊙ denotes point-wise multiplication. The means and variances are calculated on a whole utterance in both training and inference, which makes layer normalization not applicable in a causal setup.

In this study, we explore three causal normalization techniques as substitutes for layer normalization. In standard batch normalization (BN) [8], features are normalized over all but the channel dimension during training:

E[z]=1BTFb,t,fz(b,t,f,k) (8)
Var[z]=1BTFb,t,f(z(b,t,f,k)E[z])2 (9)
BN(z)=zE[z]Var[z]+ϵγ+β (10)

where γ and β again denote trainable gain and bias. Mean and variance gathered in the training phase are utilized for all test utterances. Since recalculation of statistics is not needed, batch normalization is causal during inference.

Because of the complexity of Dense-UNet/TCN, a small batch size is used (4 or 8) during training. Channel-dependent mean and variance in BN may fluctuate severely across mini-batches. We propose a channel-independent version of batch normalization (ciBN) to overcome this issue. In ciBN, features are normalized over all dimensions during training:

E[z]=1BTFKb,t,f,kz(b,t,f,k) (11)
Var[z]=1BTFKb,t,f,k(z(b,t,f,k)E[z])2 (12)
ciBN(z)=zE[z]Var[z]+ϵγ+β (13)

Mean and variance gathered in training are used for inference.

We also consider a causal version of layer normalization (cLN), where the features are normalized in a causal fashion.

E[z(t=τ)]=1τFKtτ,f,kz(b,t,f,k) (14)
Var[z(t=τ)]=1τFKtτ,f,k(z(b,t,f,k)E[z(t=τ)])2 (15)
cLN(z(t=τ))=z(t=τ)E[z(t=τ)]Var[z(t=τ)]+ϵγ+β (16)

Here z(t = τ) denotes the τth frame of the feature map. In cLN, normalization is conducted frame by frame, with framedependent mean and variance calculated using all previous frames. A similar normalization technique was used in the causal version of Conv-TasNet [20].

The three causal normalization techniques can also be applied to the TCN in the sequential grouping stage. All operations stay the same, but the frequency dimension is neglected.

In addition to BN, ciBN and cLN, we plan to explore multi-GPU training with data parallelism and synchronized batch normalization in future research, which can greatly increase the batch size in training.

C. Clustering

Once embedding vectors are generated, a clustering step is needed to assign them to different speakers. Most clustering based speaker separation algorithms, e.g., deep clustering and deep CASA, perform this step in an offline fashion. In deep clustering, the K-means algorithm iteratively generates centroids of clusters using all embedding vectors in the whole utterance. It is difficult to make a causal extension to K-means for deep clustering, as embedding vectors corresponding to some clusters may not be present in the beginning part of an utterance. Therefore, the number of clusters is unclear for causal processing.

On the other hand, in the setting of multi-speaker deep CASA, there are C embedding vectors in each frame, each belonging to a unique cluster. The design of causal clustering becomes much easier. The details are given in Algorithm 2.

Algorithm 2: Causal Clustering for Multi-Speaker Deep CASA.
Input:Embedding vectorsVc(t),frame-level energy ofthe mixtureE(t),energy thresholdα,maximal queuesizeSmaxOutput:Frame-level labels of all outputsΘ(t)forcin{1,,C}doQcNEW_FIFO_QUEUE()Qc.enqueue(Vc(1))μcQc.mean()ϴc(1)cendforEmaxE(1),t2whiletTdoΘ(t)argmaxθ(t)PCc=1Vθc(t)(t)μcTifE(t)>αEmaxthenforcin{1,,C}doQc.equeue(VΘc(t)(t))ifQc.size()>SmaxthenQc.dequeue()endifμcQc.mean()endforendifEmaxmax(Emax,E(t))tt+1endwhile

At the start of the algorithm, C first-in-first-out (FIFO) queues are created to store embedding vectors belonging to the clusters. Each embedding vector in the first frame is pushed to one of the queues to form the initial data. Centroids of the clusters are calculated as mean values of the queues. Starting from frame two, each embedding vector is assigned to a unique cluster using the assignment that maximizes the sum of similarities between embeddings and centroids. If the energy of the current frame is insignificant, we move to the next frame. Otherwise, we push the embedding vectors to their corresponding queues, and update the centroids. In order to keep the centroids relatively near the current frame, we remove the oldest item in the queue when the size of the queue exceeds Smax. To decide whether a frame has significant energy, we keep track of the maximum frame energy Emax. Frames whose energy is weaker than αEmax are considered uninformative, and would not be used for centroid calculation. The frame-level assignment continues until all frames are processed. The two parameters α and Smax are set to 0.3 and 20 in our study, and the system performance is insensitive to these specific values.

We also design a causal clustering algorithm for two-speaker deep CASA, as shown in Algorithm 3. In two-speaker deep CASA, each frame only has one embedding vector, indicating the frame-level optimal assignment. At the first frame, we create 2 FIFO queues to store embedding vectors. The first embedding vector is pushed to the first queue. Starting from frame two, if the second queue is empty, we check the similarity of embedding vectors between the current frame and the previous frame. If the similarity is lower than ρ, we set the current frame to cluster 2, and push the embedding vector to the second queue. Otherwise the current frame is set to cluster 1, and the checking continues. Once the second queue loads the first item, the algorithm starts to follow the same process as in Algorithm 2. The energy threshold α, similarity threshold ρ, and Smax, are set to 0.3, 0.5 and 10, respectively. Both Algorithm 2 and 3 are easy to implement and fast during inference.

Algorithm 3: Causal Clustering for Two-Speaker Deep CASA.
Input:Embedding vectorsV(t),frame-level energy ofthe mixtureE(t),energy thresholdα,similaritythresholdρ,maximal queue sizeSmaxOutput:Frame-level labelΘ(t)forcin{1,2}doQcNEW_FIFO_QUEUE()endforQ1.enqueue(V(1))μ1V(1),EmaxE(1),Θ(1)1,t2whiletTdoifQ2.empty()thenifV(t1)V(t)T<ρthenΘ(t)2elseΘ(t)1endifelseΘ(t)argmaxc{1,2}V(t)μcTendififE(t)>αEmaxor(Q2.empty()andΘ(t)==2)thenQΘ(t).enqueue(V(t))ifQΘ(t).size()>SmaxthenQΘ(t).dequeue()endifμΘ(t)QΘ(t).mean()endifEmaxmax(Emax,E(t))tt+1endwhile

D. Speaker-Number-Independent Training

The total number of concurrent speakers is usually unknown in real-world causal applications. A system that generalizes well to an unknown speaker number is crucial for these situations. Although multi-speaker deep CASA is designed for C concurrent (C ≥ 2) speakers, if trained properly, a C-speaker system can generate good results for speech mixtures with less than C speakers, without the prior knowledge about the speaker number. In such cases, some of the outputs produce significantly lower energy than other outputs, corresponding to silence. The details of speaker-number-independent training are presented in Section IV-C.

IV. Evaluation and Comparison

A. Experimental Setup

We evaluate our systems on two-speaker and three-speaker separation datasets, WSJ0-2mix and WSJ0-3mix [6]. Both datasets have a 30-hour training set and a 10-hour validation set generated by selecting random speakers in the Wall Street Journal (WSJ0) training set, and mixing them at various SNRs between 0 dB and 5 dB. Evaluation is conducted on the opencondition (OC) test sets, which are similarly generated using 16 untrained speakers from the WSJ0 development set. All mixtures are sampled at 8 kHz. We calculate STFT with a frame length of 32 ms, a frame shift of 8 ms, and a square root Hanning window.

Performance is evaluated in terms of signal-to-distortion ratio improvement (ΔSDR) [24], perceptual evaluation of speech quality (PESQ) whose values range from −0.5 to 4.5 [9], and extended short-time objective intelligibility (ESTOI) whose values range typically between 0 and 1 [10]. Results are also reported in terms of scale-invariant signal-to-noise ratio improvement (ΔSI-SNR) [21] for a systematical comparison with other systems.

B. Models

All deep CASA systems in this study adopt the basic structure of Dense-UNet and TCN as in [18]. In Dense-UNet, the number of dense layers in a dense block is set to 5, the number of channels in each dense layer is set to 64, and all dense layers have a kernel size of 3 × 3 and a stride of 1 × 1. The middle layer in each dense block is replaced with a frequency mapping layer. The network is optimized with respect to JtPIT–SNR.

In TCN, the maximum dilation factor is set to 64. The number of bottleneck units is selected as 256. The number of units in depthwise dilated convolutional layers is set to 512. DropDilation with a keep rate of 0.7 is applied during training.

Both networks are trained with the Adam optimization algorithm [11]. The initial learning rate is set to 0.0001 for Dense-UNet, and 0.00025 for TCN. Learning rate adjustment and early stopping are employed based on the loss on the validation set.

For causal deep CASA, temporal connections, normalization and clustering algorithms are modified as described in Section III. The model looks back 72 past frames in simultaneous grouping, and 1016 past frames in sequential grouping. Thus the theoretical receptive field of causal deep CASA is 8.704 seconds, all in the past. The latency of causal deep CASA corresponds to one frame of STFT, which is 32 ms.

C. Results and Comparisons

We first evaluate causal deep CASA on two-speaker mixtures. Different simultaneous grouping models are compared in Table I. Outputs are organized with the optimal speaker assignment before evaluation. The first row corresponds to Dense-UNet with non-causal connections and normalization. A modest performance drop is observed when we switch to the causal versions. Two normalization techniques are evaluated for causal processing. BN leads to negligibly better results than ciBN. Due to slow training, we did not use cLN for causal Dense-UNet, and leave it as future work.

TABLE I.

Average ΔSDR, PESQ and ESTOI for Simultaneous Grouping Models With the Optimal Output Assignment on WSJ0-2mix OC

Simul. Group. Temporal convolution Normalization ΔSDR (dB) PESQ ESTOI (%)
Dense-UNet Non-causal LN 19.1 3.63 94.3
Dense-UNet Causal BN 18.0 3.52 93.2
Dense-UNet Causal ciBN 17.8 3.52 93.1

Table II compares different sequential grouping models for two-speaker mixtures. All sequential grouping TCNs in the table are built with causal connections and normalization, and trained on top of the causal Dense-UNet with BN. The first three rows compare three causal normalization techniques for two-speaker deep CASA. Thanks to the matched calculation of statistics in the training and test, cLN substantially outperforms the other two techniques. We also train a causal TCN with cLN under the multi-speaker setup, as given in the fourth row. It performs slightly worse than the two-speaker version, reflecting the principle of Occam’s razor. When the number of concurrent speakers is fixed to 2, one embedding vector per frame is enough to indicate the optimal output assignment. The extra embedding vectors in multi-speaker deep CASA do not convey much information, and lead to worse performance during inference.

TABLE II.

Average ΔSDR, PESQ and ESTOI for Different Sequential Grouping Models on WSJ0-2mix OC

Seq. Group. Temporal convolution Normalization Clustering ΔSDR (dB) PESQ ESTOI (%)
Two-speaker Causal BN Causal 13.9 3.02 87.0
Two-speaker Causal ciBN Causal 14.6 3.12 88.5
Two-speaker Causal cLN Causal 15.1 3.18 89.4
Multi-speaker Causal cLN Causal 14.8 3.14 88.9
Two-speaker Causal cLN Offline 15.2 3.19 89.5
Multi-speaker Causal cLN Offline 14.9 3.16 89.0

The last two rows in Table II report the results of causal TCNs with non-causal clustering. All settings follow the third and fourth row in Table II except for the clustering algorithms. The causal clustering algorithms yield almost the same results as non-causal offline clustering, demonstrating the effectiveness of the proposed clustering.

Next, we jointly optimize the two stages of deep CASA. The results are reported in Table III. Four deep CASA systems are evaluated, either causal or non-causal, and two-speaker or multi-speaker. Joint optimization is performed in a similar fashion as in [18]. Compared to the results in Table II, modest improvements are achieved by causal deep CASA when joint optimization is performed. There is still a small gap between two-speaker and multi-speaker deep CASA in Table III, consistent with Table II. In addition to ΔSDR, PESQ and ESTOI, frame assignment error (FAE) is reported to show the percentage of incorrectly assigned frames in terms of minimum frame-level loss, in other words, errors in speaker tracking. FAE nearly triples when we switch from non-causal deep CASA to the causal ones, which suggests a major cause why causal deep CASA performs worse in terms of all metrics.

TABLE III.

Average ΔSDR, PESQ, ESTOI and Frame Assignment Error (FAE) for Deep CASA With Joint Optimization on WSJ0-2mix OC

Deep CASA with joint optimization Causal ΔSDR (dB) PESQ ESTOI (%) FAE (%)
Two-speaker 18.0 3.51 93.2 1.22
Multi-speaker 17.8 3.50 93.0 1.45
Two-speaker 15.5 3.25 90.1 3.58
Multi-speaker 15.2 3.23 89.7 3.86

To further illustrate the FAE of non-causal and causal deep CASA, we compare their separated results in Fig. 4. The first two rows show a male-male test mixture and the two target speakers. The third row shows the results of non-causal deep CASA, which makes correct assignment decisions in almost every frame, and only misses a few high frequency details. The fourth row corresponds to causal deep CASA. From 0 s to 2.5 s, and 3.3 s to 5.5 s, causal and non-causal deep CASA almost generate identical outputs. However, causal deep CASA makes successive incorrect assignments between 2.5 s and 3.3 s, due to the lack of future information and limited past information.

Fig. 4.

Fig. 4.

Speaker separation results of deep CASA in log magnitude STFT. Two jointly-optimized two-speaker models, non-causal and causal deep CASA, are compared. (a) A male-male test mixture. (b) Speaker 1 in the mixture. (c) Speaker 2 in the mixture. (d) Non-causal deep CASA’s output 1. (e) Non-causal deep CASA’s output 2. (f) Causal deep CASA’s output 1. (g) Causal deep CASA’s output 2.

Table IV compares causal deep CASA (with joint optimization) and other state-of-the-art talker-independent methods on WSJ0-2mix OC. For all methods, we list the best reported results, and leave unreported fields blank. The numbers of parameters in different methods are estimated according to their papers. The second and third row present two non-causal methods. Conv-TasNet [20] extends uPIT to the waveform domain using a convolutional neural network. The sign prediction network [28] combines DC and uPIT, and train a separate network for phase reconstruction. All the other systems in the table are causal. The Listen and Group system [16] estimates frame-level spectral outputs in an autoregressive fashion. It consists of two stages. In the first stage, the frame-level mixture and source estimates from the previous frame are transformed into mid-level representations. The second stage groups mid-level representations to two sources. We present the fully causal version of Listen and Group, which has no look-aheads for phase reconstruction. Other models include causal versions of uPIT, LSTM-TasNet and Conv-TasNet. As demonstrated in the table, our causal deep CASA system outperforms all causal methods by a large margin. It even surpasses the ideal binary mask (IBM), and matches the performance of non-causal Conv-TasNet, demonstrating the power of the proposed causal extension.

TABLE IV.

Number of Parameters, Average ΔSDR, ΔSI-SNR, PESQ and ESTOI for Various State-of-the-Art Systems Evaluated on WSJ0-2mix OC

# of param. Causal ΔSDR (dB) ΔSI-SNR (dB) PESQ ESTOI (%)
Mixture - - 0.0 0.0 2.02 56.1
Conv-TasNet [20] 5.1M 15.6 15.3 3.24 -
Sign Prediction Net [28] 56.6M 15.4 15.2 3.45 -
uPIT [13] 46.3M 7.0 - - -
Conv-TasNet [20] 5.1M 11.0 10.6 - -
LSTM-TasNet [20] 32.0M 11.2 10.8 - -
Listen and Group [16] 8.2M 11.0 - - -
Two-speaker deep CASA 12.8M 15.5 15.2 3.25 90.1
IBM - - 13.8 13.4 3.28 89.1

Table V compares multi-speaker deep CASA (with joint optimization) and other state-of-the-art methods on three-speaker mixtures WSJ0-3mix. As shown in the upper half of the table, deep CASA produces systematically better results than other methods under the non-causal setup. When we switch to the causal setting, the performance of deep CASA drops significantly as expected, mostly due to the lack of future information for sequential grouping. Despite the fact that causal processing lacks future information, which is inherently useful for speech processing, the proposed causal extension keeps the assignment errors to a low level and substantially outperforms the best published causal results by Conv-TasNet [20] on WSJ0-3mix OC.

TABLE V.

Number of Parameters, Average ΔSDR, ΔSI-SNR, PESQ and ESTOI for Various State-of-the-Art Systems Evaluated on WSJ0-3mix OC

# of param. Causal ΔSDR (dB) ΔSI-SNR (dB) PESQ ESTOI (%)
Mixture - - 0.0 0.0 1.66 38.5
uPIT [13] 92.7M 7.7 - - -
Conv-TasNet [20] 5.1M 13.1 12.7 2.61 -
Sign Prediction Net [28] 56.6M 12.5 12.1 2.77 -
Multi-speaker deep CASA 12.8M 14.8 14.5 2.83 81.5
Conv-TasNet [20] 5.1M 8.2 7.8 - -
Multi-speaker deep CASA 12.8M 10.1 9.8 2.28 70.6
IBM - - 13.6 13.3 2.86 82.1

Although multi-speaker deep CASA is designed for C concurrent speakers, in theory, a C-speaker system can be directly applied to speech mixtures with less than C speakers. In Table VI, we evaluate the three-speaker deep CASA systems presented in Table V on two-speaker mixtures (WSJ0-2mix OC). The two outputs with significant energy are selected as active speakers during evaluation. As shown in Table VI, the three-speaker systems yield substantially worse results than the two-speaker systems (cf. Table III) on two-speaker test mixtures, possibly due to the mismatch between training and test. Moreover, there is significant residual energy in the discarded output of three-speaker deep CASA, i.e., −16.9 dB relative to the other two outputs.

TABLE VI.

Average ΔSDR, PESQ and ESTOI for Multi-Speaker Deep CASA, Trained on WSJ0-3mix and Evaluated on WSJ0-2mix OC

Training set Causal ΔSDR (dB) ΔSI-SNR (dB) PESQ ESTOI (%)
Multi-speaker deep CASA WSJ0-3mix 14.8 14.4 3.12 87.6
11.4 10.9 2.76 82.7

To make the three-speaker systems generalize to two-speaker mixtures, we fine-tune three-speaker deep CASA with mixtures from both WSJ0-2mix and WSJ0-3mix. The fine-tuning is conducted similarly as joint optimization, where the two stages are updated together with a small learning rate. To enable the training of three-speaker models on WSJ0-2mix, we extend WSJ0-2mix with a third silent channel, which contains zero energy. To avoid infinite SNR objective for the silent channel, a time-domain l1 loss is used instead to tune the simultaneous grouping module. Table VII shows the results of three-speaker deep CASA fine-tuned on WSJ0-2mix and WSJ0-3mix, and compares it with other speaker-number-independent approaches trained on WSJ0-2mix and WSJ0-3mix. All comparison approaches are uPIT based, as deep clustering based methods do not perform well when the number of speakers is unknown. The results are reported on both WSJ0-2mix oC and WSJ0-3mix OC. The first three rows summarize speaker-number-independent training of non-causal systems, where deep CASA substantially outperforms the other two approaches in terms of all four metrics on both datasets. While there is no prior result on a causal algorithm for speaker-number-independent separation, speaker-number-independent causal deep CASA, as shown in the fourth row, yields satisfactory results, and even outperforms speaker-number-dependent causal methods in Table IV and V. For two-speaker mixtures, speaker-number-independent training reduces the relative energy in the discarded output of three-speaker deep CASA to −43.5 dB, negligible for practical utility.

TABLE VII.

Average ΔSDR, ΔSI-SNR, PESQ and ESTOI for Speaker-Number-Independent Systems Evaluated on WSJ0-2mix OC and WSJ0-3mix OC

Causal WSJ0-2mix OC WSJ0-3mix OC
ΔSDR (dB) ΔSI-SNR (dB) PESQ ESTOI (%) ΔSDR (dB) ΔSI-SNR (dB) PESQ ESTOI (%)
uPIT [13] 10.1 - - - 7.8 - - -
OR-PIT [23] 15.0 14.8 3.12 - 12.9 12.6 2.60 -
Multi-speaker deep CASA 17.6 17.4 3.40 92.0 14.8 14.5 2.77 81.2
Multi-speaker deep CASA 14.2 13.9 3.06 87.8 10 9.6 2.12 69.7

Finally Table VIII reports the computational costs of neural networks in terms of real time factor (RTF), which is defined as the ratio of processing time to input signal duration. RTF is evaluated by running the neural networks for three-speaker deep CASA (implemented in Tensorflow) on a single NVIDIA V100 GPu. one hundred seconds of input mixtures are processed for evaluation. As shown in the table, although the removal of temporal downsampling layers slightly slows the inference speed of causal deep CASA, it runs much faster than real time. Non-causal and causal DNNs are on a similar scale in terms of RTF. In addition to the neural networks, all clustering algorithms in this study have the complexity of O(T) and can run fast on CPus with proper optimization.

TABLE VIII.

Real Time Factor of Deep CASA

RTF
Causal deep CASA 0.0110
Non-causal deep CASA 0.0077

V. Conclusions

We have proposed a causal deep CASA algorithm for monaural talker-independent speaker separation. We adapt temporal connections and normalization in deep CASA, and propose two causal clustering algorithms. Experimental results on the benchmark WSJ0-2mix and WSJ0-3mix datasets show that the proposed causal algorithm outperforms all published results for causal speaker separation. in addition, speaker-number-independent training broadens the utility of causal deep CASA to a more realistic scenario when the speaker number is not given beforehand. This study represents a major step towards speaker separation in real-time applications.

Although causal deep CASA shows excellent performance on simulated datasets, it has several limitations. First, its performance degrades substantially with the increase of concurrent speakers. it should be noted that this is a common problem in other studies [20], [23], due to the fact that additional speakers increase the difficulty of both simultaneous and sequential grouping. Second, the current system assumes simultaneous speakers, and does not perform well on real conversations with varying degrees of overlapped speech. Third, in this study, causal deep CASA is only trained and evaluated on clean speaker mixtures without other kinds of interference. Recently, we have extended non-causal deep CASA to deal with room reverberation [4] and background noise [17]. We plan to extend causal deep CASA to overcome these limitations in future research.

Acknowledgments

This work was supported in part by two NIDCD under Grants R01 DC012048 and R01 DC015521, and in part by the Ohio Supercomputer Center. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Huseyin Hacihabiboglu.

Contributor Information

Yuzhou Liu, Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA.

DeLiang Wang, Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA.

References

  • [1].Aihara R, Hanazawa T, Okato Y, Wichern G, and Roux JL, “Teacherstudent deep clustering for low-delay single channel speech separation,” in Proc. Int. Conf. Acoust., Speech, 2019, pp. 690–694. [Google Scholar]
  • [2].Ba JL, Kiros JR, and Hinton GE, “Layer normalization,” 2016, arXiv:1607.06450.
  • [3].Bai S, Kolter JZ, and Koltun V, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” 2018, arrXiv:1803.01271.
  • [4].Delfarah M, Liu Y, and Wang DL, “Talker-independent speaker separation in reverberant conditions,” in Proc. Int. Conf. Acoust., Speech, 2020, pp. 8723–8727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Herbig R and Chalupper J, “Acceptable processing delay in digital hearing aids,” Hearing Rev., vol. 17, pp. 28–31, 2010. [Google Scholar]
  • [6].Hershey JR, Chen Z, Roux JL, and Watanabe S, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. Int. Conf. Acoust., Speech, 2016, pp. 31–35. [Google Scholar]
  • [7].Higuchi T, Kinoshita K, Delcroix M, Zmolíková K, and Nakatani T, “Deep clustering-based beamforming for separation with unknown number of sources,” in Proc. Interspeech, 2017, pp. 1183–1187. [Google Scholar]
  • [8].Ioffe S and Szegedy C, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn, 2015, pp. 448–456. [Google Scholar]
  • [9].Rix AW, Beerends JG, Hollier MP, and Hekstra AP, “Perceptual evaluation of speech quality (PESQ) A new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, vol. 2, 2001, pp. 749–752,. [Google Scholar]
  • [10].Jensen J and Taal CH, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, no. 11, pp. 2009–2022, November 2016. [Google Scholar]
  • [11].Kingma D and Ba J, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015. [Google Scholar]
  • [12].Kinoshita K, Drude L, Delcroix M, and Nakatani T, “Listening to each speaker one by one with recurrent selective hearing networks,” in Proc. Int. Conf. Acoust., Speech, 2018, pp. 5064–5068. [Google Scholar]
  • [13].Kolbak M, Yu D, Tan ZH, and Jensen J, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 25, no. 10, pp. 1901–1913, October 2017. [Google Scholar]
  • [14].Lea C, Vidal R, Reiter A, and Hager GD, “Temporal convolutional networks: A unified approach to action segmentation,” in Proc. Eur. Conf. Comput. Vision, 2016, pp. 47–54. [Google Scholar]
  • [15].Li C, Zhu L, Xu S, Gao P, and Xu B, “CBLDNN-based speaker-independent speech separation via generative adversarial training,” in Proc. Int. Conf. Acoust., Speech, 2018, pp. 711–715. [Google Scholar]
  • [16].Li Z-X, Song Y, Dai L-R, and McLoughlin I, “Listening and grouping: An online autoregressive approach for monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 27, no. 4, pp. 692–703, April 2019. [Google Scholar]
  • [17].Liu Y, Delfarah M, and Wang DL, “DeepCASAfortalker-independent monaural speech separation,” in Proc. Int. Conf. Acoust., Speech, 2020, pp. 6354–6358. [Google Scholar]
  • [18].Liu Y and Wang DL, “Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 27, no. 12, pp. 2092–2102, December 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Luo Y, Chen Z, and Mesgarani N, “Speaker-independent speech separation with deep attractor network,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 26, no. 4, pp. 787–796, April 2018. [Google Scholar]
  • [20].Luo Y and Mesgarani N, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 27, no. 8, pp. 1256–1266, August 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Roux JL, Wisdom S, Erdogan H, and Hershey JR, “SDR-half-baked or well done?” in Proc. Int. Conf. Acoust., Speech, 2019, pp. 626–630. [Google Scholar]
  • [22].Shi Z, Lin H, Liu L, Liu R, Han J, and Shi A, “Deep attention gated dilated temporal convolutional networks with intra-parallel convolutional modules for end-to-end monaural speech separation,” in Proc. Interspeech, 2019, pp. 3183–3187. [Google Scholar]
  • [23].Takahashi N, Parthasaarathy S, Goswami N, and Mitsufuji Y, “Recursive speech separation for unknown number of speakers,” in Proc. Interspeech, 2019, pp. 1348–1352. [Google Scholar]
  • [24].Vincent E, Gribonval R, and Févotte C, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process, vol. 14, no. 4, pp. 1462–1469, July 2006. [Google Scholar]
  • [25].Wang DL and Brown G, Eds., Computational Auditory Scene Analysis: Principles, Algorithms and Applications. New York, NY, USA: Wiley, 2006. [Google Scholar]
  • [26].Wang Z-Q, Roux JL, and Hershey JR, “Alternative objective functions for deep clustering,” in Proc. Int. Conf. Acoust., Speech, 2018, pp. 686–690. [Google Scholar]
  • [27].Wang Z-Q, Roux JL, Wang DL, and Hershey JR, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Proc. Interspeech, 2018, pp. 2708–2712. [Google Scholar]
  • [28].Wang Z-Q, Tan K, and Wang DL, “Deep learning based phase reconstruction for speaker separation: A trigonometric perspective,” in Proc. Int. Conf. Acoust., Speech, 2019, pp. 71–75. [Google Scholar]

RESOURCES