Abstract
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
Keywords: Seech separation, speaker separation, speech enhancement, supervised speech separation, deep learning, deep neural networks, speech dereverberation, time-frequency masking, array separation, beamforming
I. INTRODUCTION
THE goal of speech separation is to separate target speech from background interference. Speech separation is a fundamental task in signal processing with a wide range of applications, including hearing prosthesis, mobile telecommunication, and robust automatic speech and speaker recognition. The human auditory system has the remarkable ability to extract one sound source from a mixture of multiple sources. In an acoustic environment like a cocktail party, we seem capable of effortlessly following one speaker in the presence of other speakers and background noises. Speech separation is commonly called the “cocktail party problem,” a term coined by Cherry in his famous 1953 paper [26].
Speech separation is a special case of sound source separation. Perceptually, source separation corresponds to auditory stream segregation, a topic of extensive research in auditory perception. The first systematic study on stream segregation was conducted by Miller and Heise [124] who noted that listeners split a signal with two alternating sine-wave tones into two streams. Bregman and his colleagues have carried out a series of studies on the subject, and in a seminal book [15] he introduced the term auditory scene analysis (ASA) to refer to the perceptual process that segregates an acoustic mixture and groups the signal originating from the same sound source. Auditory scene analysis is divided into simultaneous organization and sequential organization. Simultaneous organization (or grouping) integrates concurrent sounds, while sequential organization integrates sounds across time. With auditory patterns displayed on a time-frequency representation such as a spectrogram, main organizational principles responsible for ASA include: Proximity in frequency and time, harmonicity, common amplitude and frequency modulation, onset and offset synchrony, common location, and prior knowledge (see among others [11], [15], [29],[30], [32], [163]). These grouping principles also govern speech segregation [4], [31], [49], [93], [154], [201]. From ASA studies, there seems to be a consensus that the human auditory system segregates and attends to a target sound, which can be a tone sequence, a melody, or a voice. More debatable is the role of auditory attention in stream segregation [17], [120], [148], [151]. In this overview, we use speech separation (or segregation) primarily to refer to the computational task of separating the target speech signal from a noisy mixture.
How well do we perform speech segregation? One way of quantifying speech perception performance in noise is to measure speech reception threshold, the required SNR level for a 50% intelligibility score. Miller [123] reviewed human intelligibility scores when interfered by a variety of tones, broadband noises, and other voices. Listeners were tested for their word intelligibility scores, and the results are shown in Fig. 1. In general, tones are not as interfering as broadband noises. For example, speech is intelligible even when mixed with a complex tone glide that is 20 dB more intense (pure tones are even weaker interferers). Broadband noise is the most interfering for speech perception, and the corresponding SRT is about 2 dB. When interference consists of other voices, the SRT depends on how many interfering talkers are present. As shown in Fig. 1, the SRT is about −10 dB for a single interferer but rapidly increases to −2 dB for two interferers. The SRT stays about the same (around −1 dB) when the interference contains four or more voices. There is a whopping SRT gap of 23 dB for different kinds of interference! Furthermore, it should be noted that listeners with hearing loss show substantially higher SRTs than normal-hearing listeners, ranging from a few decibels for broadband stationary noise to as high as 10–15 dB for interfering speech [44], [127], indicating a poorer ability of speech segregation.
With speech as the most important means of human communication, the ability to separate speech from background interference is crucial, as the speech of interest, or target speech, is usually corrupted by additive noises from other sound sources and reverberation from surface reflections. Although humans perform speech separation with apparent ease, it has proven to be very challenging to construct an automatic system to match the human auditory system in this basic task. In his 1957 book[27], Cherry made an observation: “No machine has yet been constructed to do just that [solving the cocktail part problem].” His conclusion, unfortunately for our field, has remained largely true for 6 more decades, although recent advances reviewed in this article have started to crack the problem.
Given the importance, speech separation has been extensively studied in signal processing for decades. Depending on the number of sensors or microphones, one can categorize separation methods into monaural (single-microphone) and array-based (multi-microphone). Two traditional approaches for monaural separation are speech enhancement [113] and computational auditory scene analysis (CASA) [172]. Speech enhancement analyzes general statistics of speech and noise, followed by estimation of clean speech from noisy speech with a noise estimate [40], [113]. The simplest and most widely used enhancement method is spectral subtraction [13], in which the power spectrum of the estimated noise is subtracted from that of noisy speech. In order to estimate background noise, speech enhancement techniques typically assume that background noise is stationary, i.e., its spectral properties do not change over time, or at least are more stationary than speech. CASA is based on perceptual principles of auditory scene analysis[15] and exploits grouping cues such as pitch and onset. For example, the tandem algorithm separates voiced speech by alternating pitch estimation and pitch-based grouping [78].
An array with two or more microphones uses a different principle to achieve speech separation. Beamforming, or spatial filtering, boosts the signal that arrives from a specific direction through proper array configuration, hence attenuating interference from other directions [9], [14], [88], [164]. The simplest beamformer is a delay-and-sum technique that adds multiple microphone signals from the target direction in phase and uses phase differences to attenuate signals from other directions. The amount of noise attenuation depends on the spacing, size, and configuration of the array – generally the attenuation increases as the number of microphones and the array length increase. Obviously, spatial filtering cannot be applied when target and interfering sources are co-located or near to one another. Moreover, the utility of beamforming is much reduced in reverberant conditions, which smear the directionality of sound sources.
A more recent approach treats speech separation as a supervised learning problem. The original formulation of supervised speech separation was inspired by the concept of time-frequency (T-F) masking in CASA. As a means of separation, T-F masking applies a two-dimensional mask (weighting) to the time-frequency representation of a source mixture in order to separate the target source [117], [170], [172]. A major goal of CASA is the ideal binary mask (IBM) [76], which denotes whether the target signal dominates a T-F unit in the time-frequency representation of a mixed signal. Listening studies show that ideal binary masking dramatically improves speech intelligibility for normal-hearing (NH) and hearing-impaired (HI) listeners in noisy conditions [1], [16], [109], [173]. With the IBM as the computational goal, speech separation becomes binary classification, an elementary form of supervised learning. In this case, the IBM is used as the desired signal, or target function, during training. During testing, the learning machine aims to estimate the IBM. Although it served as the first training target in supervised speech separation, the IBM is by no means the only training target and Section III presents a list of training targets, many shown to be more effective.
Since the formulation of speech separation as classification, the data-driven approach has been extensively studied in the speech processing community. Over the last decade, supervised speech separation has substantially advanced the state-of-theart performance by leveraging large training data and increasing computing resources [21]. Supervised separation has especially benefited from the rapid rise in deep learning – the topic of this overview. Supervised speech separation algorithms can be broadly divided into the following components: learning machines, training targets, and acoustic features. In this paper, we will first review the three components. We will then move to describe representative algorithms, where monaural and array-based algorithms will be covered in separate sections. As generalization is an issue unique to supervised speech separation, this issue will be treated in this overview.
Let us clarify a few related terms used in this overview to avoid potential confusion. We refer to speech separation or segregation as the general task of separating target speech from its background interference, which may include nonspeech noise, interfering speech, or both, as well as room reverberation. Furthermore, we equate speech separation and the cocktail party problem, which goes beyond the separation of two speech utterances originally experimented with by Cherry [26]. By speech enhancement (or denoising), we mean the separation of speech and nonspeech noise. If one is limited to the separation of multiple voices, we use the term speaker separation.
This overview is organized as follows. We first review the three main aspects of supervised speech separation, i.e., learning machines, training targets, and features, in Sections II, III, and IV, respectively. Section V is devoted to monaural separation algorithms, and Section VI to array-based algorithms. Section VII concludes the overview with a discussion of a few additional issues, such as what signal should be considered as the target and what a solution to the cocktail party problem may look like.
II. CLASSIFIERS AND LEARNING MACHINES
Over the past decade, DNNs have significantly elevated the performance of many supervised learning tasks, such as image classification [28], handwriting recognition [53], automatic speech recognition [73], language modeling [156], and machine translation [157]. DNNs have also advanced the performance of supervised speech separation by a large margin. This section briefly introduces the types of DNNs for supervised speech separation: feedforward multilayer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and generative adversarial networks (GANs).
The most popular model in neural networks is an MLP that has feedforward connections from the input layer to the output layer, layer-by-layer, and the consecutive layers are fully connected. An MLP is an extension of Rosenblatt’s perceptron [142] by introducing hidden layers between the input layer and the output layer. An MLP is trained with the classical backpropagation algorithm [143] where the network weights are adjusted to minimize the prediction error through gradient descent. The prediction error is measured by a cost (loss) function between the predicted output and the desired output, the latter provided by the user as part of supervision. For example, when an MLP is used for classification, a popular cost function is cross entropy:
where i indexes an output model neuron and pi,c denotes the predicted probability of i belonging to class c. N and C indicate the number of output neurons and the number of classes, respectively. Ii,c is a binary indicator, which takes 1 if the desired class of neuron i is c and 0 otherwise. For function approximation or regression, a common cost function is mean square error (MSE):
where ŷi and yi are the predicted output and desired output for neuron i, respectively.
The representational power of an MLP increases as the number of layers increases [142] even though, in theory, an MLP with two hidden layers can approximate any function [70]. The backpropagation algorithm is applicable to an MLP of any depth. However, a deep neural network (DNN) with many hidden layers is difficult to train from a random initialization of connection weights and biases because of the so-called vanishing gradient problem, which refers to the observation that, at lower layers (near the input end), gradients calculated from backpropagated error signals from upper layers, become progressively smaller or vanishing. As a result of vanishing gradients, connection weights at lower layers are not modified much and therefore lower layers learn little during training. This explains why MLPs with a single hidden layer were the most widely used neural network prior to the advent of DNN.
A breakthrough in DNN training was made by Hinton et al. [74]. The key idea is to perform layerwise unsupervised pre-training with unlabeled data to properly initialize a DNN before supervised learning (or fine tuning) is performed with labeled data. More specifically, Hinton et al. [74] proposed restrictive Boltzmann machines (RBMs) to pretrain a DNN layer by layer, and RBM pretraining is found to improve subsequent supervised learning. A later remedy was to use a rectified linear unit (ReLU) [128] to replace the traditional sigmoid activation function, which converts a weighted sum of the inputs to a model neuron to the neuron’s output. Recent practice shows that a moderately deep MLP with ReLUs can be effectively trained with large training data without unsupervised pretraining. Recently, skip connections have been introduced to facilitate the training of very deep MLPs [62], [153].
A class of feedforward networks, known as convolutional neural networks (CNNs) [10], [106], has been demonstrated to be well suited for pattern recognition, particularly in the visual domain. CNNs incorporate well-documented invariances in pattern recognition such as translation (shift) invariance. A typical CNN architecture is a cascade of pairs of a convolutional layer and a subsampling layer. A convolutional layer consists of multiple feature maps, each of which learns to extract a local feature regardless of its position in the previous layer through weight sharing: the neurons within the same module are constrained to have the same connection weights despite their different receptive fields. A receptive field of a neuron in this context denotes the local area of the previous layer that is connected to the neuron, whose operation of a weighted sum is akin to a convolution.1 Each convolutional layer is followed by a subsampling layer that performs local averaging or maximization over the receptive fields of the neurons in the convolutional layer. Subsampling serves to reduce resolution and sensitivity to local variations. The use of weight sharing in CNN also has the benefit of cutting down the number of trainable parameters. Because a CNN incorporates domain knowledge in pattern recognition via its network structure, it can be better trained by the backpropagation algorithm despite the fact that a CNN is a deep network.
RNNs allow recurrent (feedback) connections, typically between hidden units. Unlike feedforward networks, which process each input sample independently, RNNs treat input samples as a sequence and model the changes over time. A speech signal exhibits strong temporal structure, and the signal within the current frame is influenced by the signals in the previous frames. Therefore, RNNs are a natural choice for learning the temporal dynamics of speech. We note that a RNN through its recurrent connections introduces the time dimension, which is flexible and infinitely extensible, a characteristic not shared by feedforward networks no matter how deep they are [169]; in a way, a RNN can be viewed a DNN with an infinite depth [146]. The recurrent connections are typically trained with backpropagation through time [187]. However, such RNN training is susceptible to the vanishing or exploding gradient problem [137]. To alleviate this problem, a RNN with long short-term memory (LSTM) introduces memory cells with gates to facilitate the information flow over time [75]. Specifically, a memory cell has three gates: input gate, forget gate and output gate. The forget gate controls how much previous information should be retained, and the input gate controls how much current information should be added to the memory cell. With these gating functions, LSTM allows relevant contextual information to be maintained in memory cells to improve RNN training.
Generative adversarial networks (GANs) were recently introduced with simultaneously trained models: a generative model G and a discriminative model D [52]. The generator G learns to model labeled data, e.g., the mapping from noisy speech samples to their clean counterparts, while the discriminator – usually a binary classifier – learns to discriminate between generated samples and target samples from training data. This framework is analogous to a two-player adversarial game, where minimax is a proven strategy [144]. During training, G aims to learn an accurate mapping so that the generated data can well imitate the real data so as to fool D; on the other hand, D learns to better tell the difference between the real data and synthetic data generated by G. Competition in this game, or adversarial learning, drives both models to improve their accuracy until generated samples are indistinguishable from real ones. The key idea of GANs is to use the discriminator to shape the loss function of the generator. GANs have recently been used in speech enhancement (see Section V.A).
In this overview, a DNN refers to any neural network with at least two hidden layers [10], [73], in contrast to popular learning machines with just one hidden layer such as commonly used MLPs, support vector machines (SVMs) with kernels, and Gaussian mixture models (GMMs). As DNNs get deeper in practice, with more than 100 hidden layers actually used, the depth required for a neural network to be considered a DNN can be a matter of a qualitative, rather than quantitative, distinction. Also, we use the term DNN to denote any neural network with a deep structure, whether it is feedforward or recurrent.
We should mention that DNN is not the only kind of learning machine that has been employed for speech separation. Alternative learning machines used for supervised speech separation include GMM [97], [147], SVM [55], and neural networks with just one hidden layer [91]. Such studies will not be further discussed in this overview as its theme is DNN based speech separation.
III. TRAINING TARGETS
In supervised speech separation, defining a proper training target is important for learning and generalization. There are mainly two groups of training targets, i.e., masking-based targets and mapping-based targets. Masking-based targets describe the time-frequency relationships of clean speech to background interference, while mapping-based targets correspond to the spectral representations of clean speech. In this section, we survey a number of training targets proposed in the field.
Before reviewing training targets, let us first describe evaluation metrics commonly used in speech separation. A variety of metrics has been proposed in the literature, depending on the objectives of individual studies. These metrics can be divided into two classes: signal-level and perception-level. At the signal level, metrics aim to quantify the degrees of signal enhancement or interference reduction. In addition to the traditional SNR, speech distortion (loss) and noise residue in a separated signal can be individually measured [77], [113]. A prominent set of evaluation metrics comprises SDR (source-to-distortion ratio), SIR (source-to-interference ratio), and SAR (source-to-artifact ratio) [165].
As the output of a speech separation system is often consumed by the human listener, a lot of effort has been made to quantitatively predict how the listener perceives a separated signal. Because intelligibility and quality are two primary but different aspects of speech perception, objective metrics have been developed to separately evaluate speech intelligibility and speech quality. With the IBM’s ability to elevate human speech intelligibility and its connection to the articulation index (AI) [114] – the classic model of speech perception – the HIT − FA rate has been suggested as an evaluation metric with the IBM as the reference [97]. HIT denotes the percent of speech-dominant T-F units in the IBM that is correctly classified and FA (false-alarm) refers to the percent of noise-dominant units that is incorrectly classified. The HIT−FA rate is found to be well correlated with speech intelligibility [97]. In recent years, the most commonly used intelligibility metric is STOI (short-time objective intelligibility), which measures the correlation between the short-time temporal envelopes of a reference (clean) utterance and a separated utterance [89], [158]. The value range of STOI is typically between 0 and 1, which can be interpreted as percent correct. Although STOI tends to overpredict intelligibility scores [64], [102], no alternative metric has been shown to consistently correlate with human intelligibility better. For speech quality, PESQ (perceptual evaluation of speech quality) is the standard metric [140] and recommended by the International Telecommunication Union (ITU) [87]. PESQ applies an auditory transform to produce a loudness spectrum, and compares the loudness spectra of a clean reference signal and a separated signal to produce a score in a range of −0.5 to 4.5, corresponding to the prediction of the perceptual MOS (mean opinion score).
A. Ideal Binary Mask
The first training target used in supervised speech separation is the ideal binary mask [76], [77], [141], [168], which is inspired by the auditory masking phenomenon in audition [126] and the exclusive allocation principle in auditory scene analysis [15]. The IBM is defined on a two-dimensional T-F representation of a noisy signal, such as a cochleagram or a spectrogram:
(1) |
where t and f denote time and frequency, respectively. The IBM assigns the value 1 to a unit if the SNR within the T-F unit exceeds the local criterion (LC) or threshold, and 0 otherwise. Fig. 2(a) shows an example of the IBM, which is defined on a 64-channel cochleagram. As mentioned in Section I, IBM masking dramatically increases speech intelligibility in noise for normal-hearing and hearing-impaired listeners. The IBM labels every T-F unit as either target-dominant or interference-dominant. As a result, IBM estimation can naturally be treated as a supervised classification problem. A commonly used cost function for IBM estimation is cross entropy, as described in Section II.
B. Target Binary Mask
Like the IBM, the target binary mask (TBM) categorizes all T-F units with a binary label. Different from the IBM, the TBM derives the label by comparing the target speech energy in each T-F unit with a fixed interference: speech-shaped noise, which is a stationary signal corresponding to the average of all speech signals. An example of the TBM is shown in Fig. 2(b). Target binary masking also leads to dramatic improvement of speech intelligibility in noise [99], and the TBM has been used as a training target [51], [112].
C. Ideal Ratio Mask
Instead of a hard label on each T-F unit, the ideal ratio mask (IRM) can be viewed as a soft version of the IBM [84], [130], [152], [178]:
(2) |
where S(t, f)2 and N(t, f)2 denote speech energy and noise energy within a T-F unit, respectively. The tunable parameter β scales the mask, and is commonly chosen to 0.5. With the square root the IRM preserves the speech energy with each T-F unit, under the assumption that S(t, f) and N(t, f) are uncorrelated. This assumption holds well for additive noise, but not for convolutive interference as in the case of room reverberation (late reverberation, however, can be reasonably considered as uncor-related interference.) Without the root the IRM in (2) is similar to the classical Wiener filter, which is the optimal estimator of target speech in the power spectrum. MSE is typically used as the cost function for IRM estimation. An example of the IRM is shown in Fig. 2(c).
D. Spectral Magnitude Mask
The spectral magnitude mask (SMM) (called FFT-MASK in [178]) is defined on the STFT (short-time Fourier transform) magnitudes of clean speech and noisy speech:
(3) |
where |S(t, f)| and |Y(t, f)| represent spectral magnitudes of clean speech and speech, respectively. Unlike the IRM, the SMM is not upper-bounded by 1. To obtain separated speech, we apply the SMM or its estimate to the spectral magnitudes of noisy speech, and resynthesize separated speech with the phases of noisy speech (or an estimate of clean speech phases). Fig. 2(e) illustrates the SMM.
E. Phase-Sensitive Mask
The phase-sensitive mask (PSM) extends the SMM by including a measure of phase [41]:
(4) |
where θ denotes the difference of the clean speech phase and the noisy speech phase within the T-F unit. The inclusion of the phase difference in the PSM leads to a higher SNR, and tends to yield a better estimate of clean speech than the SMM [41]. An example of the PSM is shown in Fig. 2(f).
F. Complex Ideal Ratio Mask
The complex ideal ratio mask (cIRM) is an ideal mask in the complex domain. Unlike the aforementioned masks, it can perfectly reconstruct clean speech from noisy speech [188]:
(5) |
where S, Y denote the STFT of clean speech and noisy speech, respectively, and ‘*’ represents complex multiplication. Solving for mask components results in the following definition:
(6) |
where Yr and Yi denote real and imaginary components of noisy speech, respectively, and Sr and Si real and imaginary components of clean speech, respectively. The imaginary unit is denoted by ‘i’. Thus the cIRM has a real component and an imaginary component, which can be separately estimated in the real domain. Because of complex-domain calculations, mask values become unbounded. So some form of compression should be used to bound mask values, such as a tangent hyperbolic or sigmoidal function [184], [188].
Williamson et al. [188] observe that, in Cartesian coordinates, structure exists in both real and imaginary components of the cIRM, whereas in polar coordinates, structure exists in the magnitude spectrogram but not phase spectrogram. Without clear structure, direct phase estimation would be intractable through supervised learning, although we should mention a recent paper that uses complex-domain DNN to estimate complex STFT coefficients [107]. On the other hand, an estimate of the cIRM provides a phase estimate, a property not possessed by PSM estimation.
G. Target Magnitude Spectrum
The target magnitude spectrum (TMS) of clean speech, or |S(t, f)|, is a mapping-based training target [57], [116], [196],[197]. In this case supervised learning aims to estimate the magnitude spectrogram of clean speech from that of noisy speech. Power spectrum, or other forms of spectra such as mel spectrum, may be used instead of magnitude spectrum, and a log operation is usually applied to compress the dynamic range and facilitate training. A prominent form of the TMS is the log-power spectrum normalized to zero mean and unit variance [197]. An estimated speech magnitude is then combined with noisy phase to produce the separated speech waveform. In terms of cost function, MSE is usually used for TMS estimation. Alternatively, maximum likelihood can be employed to train a TMS estimator that explicitly models output correlation [175]. Fig. 2(g) shows an example of the TMS.
H. Gammatone Frequency Target Power Spectrum
Another closely related mapping-based target is the gamma-tone frequency target power spectrum (GF-TPS) [178]. Unlike the TMS defined on a spectrogram, this target is defined on a cochleagram based on a gammatone filterbank. Specifically, this target is defined as the power of the cochleagram response to clean speech. An estimate of the GF-TPS is easily converted to the separated speech waveform through cochleagram inversion [172]. Fig. 2(d) illustrates this target.
I. Signal Approximation
The idea of signal approximation (SA) is to train a ratio mask estimator that minimizes the difference between the spectral magnitude of clean speech and that of estimated speech[81], [186]:
(7) |
RM(t, f) refers to an estimate of the SMM. So, SA can be interpreted as a target that combines ratio masking and spectral mapping, seeking to maximize SNR [186]. A related, earlier target aims for the maximal SNR in the context of IBM estimation [91]. For the SA target, better separation performance is achieved with two-stage training [186]. In the first stage, a learning machine is trained with the SMM as the target. In the second stage, the learning machine is fine-tuned by minimizing the loss function of (7).
A number of training targets have been compared using a fixed feedforward DNN with three hidden layers and the same set of input features [178]. The separated speech using various training targets is evaluated in terms of STOI and PESQ, for predicted speech intelligibility and speech quality, respectively. In addition, a representative speech enhancement algorithm [66] and a supervised nonnegative matrix factorization (NMF) algorithm [166] are evaluated as benchmarks. The evaluation results are given in Fig. 3. A number of conclusions can be drawn from this study. First, in terms of objective intelligibility, the masking-based targets as a group outperform the mapping-based targets, although a recent study [155] indicates that masking is advantageous only at higher input SNRs and at lower SNRs mapping is more advantageous.2 In terms of speech quality, ratio masking performs better than binary masking. Particularly illuminating is the contrast between the SMM and the TMS, which are the same except for the use of |Y(t, f)| in the inator of the SMM (see (3)). The denom-better estimation of the SMM may be attributed to the fact that the target magnitude spectrum is insensitive to the interference signal and SNR, whereas the SMM is. The many-to-one mapping in the TMS makes its estimation potentially more difficult than SMM estimation. In addition, the estimation of unbounded spectral magnitudes tends to magnify estimation errors [178]. Overall, the IRM and the SMM emerge as the preferred targets. In addition, DNN based ratio masking performs substantially better than supervised NMF and unsupervised speech enhancement.
The above list of training targets is not meant to be exhaustive, and other targets have been used in the literature. Perhaps the most straightforward target is the waveform (time-domain) signal of clean speech. This indeed was used in an early study that trains an MLP to map from a frame of noisy speech waveform to a frame of clean speech waveform, which may be called temporal mapping [160]. Although simple, such direct mapping does not perform well even when a DNN is used in place of a shallow network [34], [182]. In [182], a target is defined in the time domain but the DNN for target estimation includes modules for ratio masking and inverse Fourier transform with noisy phase. This target is closely related to the PSM3. A recent study evaluates oracle results of a number of ideal masks and additionally introduces the so-called ideal gain mask (IGM) [184], defined in terms of a priori SNR and a posteriori SNR commonly used in traditional speech enhancement [113]. In [192], the so-called optimal ratio mask that takes into account of the correlation between target speech and background noise [110] was evaluated and found to be an effective target for DNN-based speech separation.
IV. FEATURES
Features as input and learning machines play complementary roles in supervised learning. When features are discriminative, they place less demand on the learning machine in order to perform a task successfully. On the other hand, a powerful learning machine places less demand on features. At one extreme, a linear classifier, like Rosenblatt’s perceptron, is all that is needed when features make a classification task linearly separable. At the other extreme, the input in the original form without any feature extraction (e.g., waveform in audio) suffices if the classifier is capable of learning appropriate features. In between are a majority of tasks where both feature extraction and learning are important.
Early studies in supervised speech separation use only a few features such as interaural time differences (ITD) and interaural level (intensity) differences (IID) [141] in binaural separation, and pitch-based features [55], [78], [91] and amplitude modulation spectrogram (AMS) [97] in monaural separation. A subsequent study [177] explores more monaural features including mel-frequency cepstral coefficient (MFCC), gamma-tone frequency cepstral coefficient (GFCC) [150], perceptual linear prediction (PLP) [67], and relative spectral transform PLP (RASTA-PLP) [68]. Through feature selection using group Lasso, the study recommends a complementary feature set comprising AMS, RASTA-PLP, and MFCC (and pitch if it can be reliably estimated), which has since been used in many studies.
We conducted a study to examine an extensive list of acoustic features for supervised speech separation at low SNRs [22]. The features have been previously used for robust automatic speech recognition and classification-based speech separation. The feature list includes mel-domain, linear-prediction, gammatone-domain, zero-crossing, autocorrelation, medium-time-filtering, modulation, and pitch-based features. The mel-domain features are MFCC and delta-spectral cepstral coefficient (DSCC) [104], which is similar to MFCC except that a delta operation is applied to mel-spectrum. The linear prediction features are PLP and RASTA-PLP. The three gammatone-domain features are gammatone feature (GF), GFCC, and gammatone frequency modulation coefficient (GFMC) [119]. GF is computed by passing an input signal to a gammatone filterbank and applying a decimation operation to subband signals. A zero-crossing feature, called zero-crossings with peak-amplitudes (ZCPA) [96], computes zero-crossing intervals and corresponding peak amplitudes from subband signals derived using a gammatone filter-bank. The autocorrelation features are relative autocorrelation sequence MFCC (RAS-MFCC) [204], autocorrelation sequence MFCC (AC-MFCC) [149] and phase autocorrelation MFCC (PAC-MFCC) [86], all of which apply the MFCC procedure in the autocorrelation domain. The medium-time filtering features are power normalized cepstral coefficients (PNCC) [95] and suppression of slowly-varying components and the falling edge of the power envelope (SSF) [94]. The modulation domain features are Gabor filterbank (GFB) [145] and AMS features. Pitch-based (PITCH) features calculate T-F level features based on pitch tracking and use periodicity and instantaneous frequency to discriminate speech-dominant T-F units from noise-dominant ones. In addition to existing features, we proposed a new feature called Multi-Resolution Cochleagram (MRCG) [22], which computes four cochleagrams at different spectrotemporal resolutions to provide both local information and a broader context.
The features are post-processed with the auto-regressive moving average (ARMA) filter [19] and evaluated with a fixed MLP based IBM mask estimator. The estimated masks are evaluated in terms of classification accuracy and the HIT − FA rate. The HIT−FA results are shown in Table I. As shown in the table, gammatone-domain features (MRCG, GF, and GFCC) consistently outperform the other features in both accuracy and HIT−FA rate, with MRCG performing the best. Cepstral compaction via discrete cosine transform (DCT) is not effective, as revealed by comparing GF and GFCC features. Neither is modulation extraction, as shown by comparing GFCC and GMFC, the latter calculated from the former. It is worth noting that the poor performance of pitch features is largely due to inaccurate estimation at low SNRs, as ground-truth pitch is shown to be quite discriminative.
TABLE I.
Factory | Babble | Engine | Cockpit | Vehicle | Tank | Average | |
---|---|---|---|---|---|---|---|
MRCG | 63 (7) | 49 (13) | 77 (4) | 73 (4) | 80 (10) | 77 (6) | 70 (7) |
GF | 61 (7) | 45 (15) | 75 (4) | 71 (3) | 80 (10) | 76 (6) | 68 (8) |
GFCC | 61 (6) | 46 (14) | 73 (4) | 70 (3) | 78 (11) | 74 (6) | 67 (7) |
DSCC | 56 (7) | 42 (14) | 70 (5) | 66 (3) | 77 (11) | 73 (6) | 64 (8) |
MFCC | 57 (7) | 43 (14) | 69 (5) | 67 (4) | 77 (11) | 72 (7) | 64 (8) |
PNCC | 56 (6) | 44 (14) | 69 (5) | 66 (4) | 77 (11) | 71 (7) | 64 (8) |
PLP | 56 (6) | 41 (12) | 68 (5) | 66 (4) | 77 (11) | 71 (7) | 63 (8) |
AC-MFCC | 56 (6) | 42 (14) | 67 (5) | 65 (4) | 77 (11) | 71 (7) | 63 (8) |
RAS-MFCC | 57 (6) | 41 (14) | 68 (5) | 66 (4) | 76 (11) | 71 (7) | 63 (8) |
GFB | 57 (7) | 41 (18) | 67 (5) | 66 (4) | 75 (12) | 70 (7) | 63 (9) |
ZCPA | 55 (8) | 40 (16) | 68 (5) | 65 (4) | 75 (13) | 70 (8) | 62 (9) |
SSF | 54 (7) | 39 (15) | 67 (5) | 60 (4) | 76 (11) | 69 (7) | 61 (8) |
RASTA-PLP | 52 (6) | 38 (15) | 64 (5) | 61 (4) | 76 (12) | 67 (7) | 60 (8) |
GFMC | 48 (7) | 35 (15) | 61 (6) | 60 (5) | 67 (17) | 59 (9) | 55 (10) |
PITCH | 46 (3) | 29 (22) | 50 (5) | 50 (2) | 59 (16) | 53 (7) | 48 (9) |
AMS | 40 (6) | 27 (9) | 49 (5) | 52 (4) | 50 (31) | 45 (11) | 44 (11) |
PAC-MFCC | 17 (5) | 11 (8) | 30 (9) | 29 (7) | 40 (48) | 21 (17) | 25 (16) |
Boldtype Indicates Best Scores
Recently, Delfarah and Wang [34] performed another feature study that considers room reverberation, and both speech denoising and speaker separation. Their study uses a fixed DNN trained to estimate the IRM, and the evaluation results are given in terms of STOI improvements over unprocessed noisy and reverberant speech. The features added in this study include log spectral magnitude (LOG-MAG) and log mel-spectrum feature (LOG-MEL), both of which are commonly used in super vised separation [82], [196]. Also included is waveform signal (WAV) without any feature extraction. For reverberation, simulated room impulse responses (RIRs) and recorded RIRs are both used with reverberation time up to 0.9 seconds. For denoising, evaluation is done separately for matched noises where the first half of each nonstationary noise is used in training and second half for testing, and unmatched noises where completely new noises are used for testing. For cochannel (two-speaker) separation, the target talker is male while the interfering talker is either female or male. Table II shows the STOI gains for the individual features evaluated. In the anechoic, matched noise case, STOI results are largely consistent with Table I. Feature results are also broadly consistent using simulated and recorded RIRs. However, the best performing features are different for the matched noise, unmatched noise, and speaker separation cases. Besides MRCG, PNCC and GFCC produce the best results for the unmatched noise and cochannel condition, respectively. For feature combination, this study concludes that the most effective feature set consists of PNCC, GF, and LOG-MEL for speech enhancement, and PNCC, GFCC, and LOG-MEL for speaker separation.
TABLE II.
Feature | Matched noise | Unmatched noise | Cochannel | Average | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Anechoic | Sim. RIRs | Rec. RIRs | Anechoic | Sim. RIRs | Rec. RIRs | Anechoic | Sim. RIRs | Rec. RIRs | ||
MRCG | 7.12 | 14.25 | 12.15 | 7.00 | 7.28 | 8.99 | 21.25 (13.00) | 22.93 (13.19) | 21.29 (12.81) | 12.92 |
GF | 6.19 | 13.10 | 11.37 | 6.71 | 7.87 | 8.24 | 22.56 (11.87) | 23.95 (12.31) | 22.35 (12.87) | 12.71 |
GFCC | 5.33 | 12.56 | 10.99 | 6.32 | 6.92 | 7.01 | 23.53 (14.34) | 23.95 (14.01) | 22.76 (13.90) | 12.50 |
LOG-MEL | 5.14 | 12.07 | 10.28 | 6.00 | 6.98 | 7.52 | 21.18 (13.88) | 22.75 (13.54) | 21.71 (13.18) | 12.08 |
LOG-MAG | 4.86 | 12.13 | 9.69 | 5.75 | 6.64 | 7.19 | 20.82 (13.84) | 22.57 (13.40) | 21.82 (13.55) | 11.91 |
GFB | 4.99 | 12.47 | 11.51 | 6.22 | 7.01 | 7.86 | 19.61 (13.34) | 20.86 (11.97) | 19.97 (11.60) | 11.75 |
PNCC | 1.74 | 8.88 | 10.76 | 2.18 | 8.68 | 10.52 | 19.97 (10.73) | 19.47 (10.03) | 19.35 (9.56) | 10.78 |
MFCC | 4.49 | 11.03 | 9.69 | 5.36 | 5.96 | 6.26 | 19.82 (11.98) | 20.32 (11.47) | 19.66 (11.54) | 10.72 |
RAS-MFCC | 2.61 | 10.47 | 9.56 | 3.08 | 6.74 | 7.37 | 18.12 (11.38) | 19.07 (11.19) | 17.87 (10.30) | 10.44 |
AC-MFCC | 2.89 | 9.63 | 8.89 | 3.31 | 5.61 | 5.91 | 18.66 (12.50) | 18.64 (11.59) | 17.73 (11.27) | 9.87 |
PLP | 3.71 | 10.36 | 9.10 | 4.39 | 5.03 | 5.81 | 16.84 (11.29) | 16.73 (10.92) | 15.46 (9.50) | 9.46 |
SSF-II | 3.41 | 8.57 | 8.68 | 4.18 | 5.45 | 6.00 | 16.76 (10.07) | 17.72 (9.18) | 18.07 (8.93) | 9.09 |
SSF-I | 3.31 | 8.35 | 8.53 | 4.09 | 5.17 | 5.77 | 16.25 (10.44) | 17.70 (9.40) | 18.04 (9.35) | 8.97 |
RASTA-PLP | 1.79 | 7.27 | 8.56 | 1.97 | 6.62 | 7.92 | 11.03 (6.76) | 10.96 (6.06) | 10.27 (6.28) | 7.46 |
PITCH | 2.35 | 4.62 | 4.79 | 3.36 | 3.36 | 4.61 | 19.71 (9.37) | 17.82 (8.45) | 16.87 (6.72) | 7.03 |
GFMC | −0.68 | 7.05 | 5.00 | −0.54 | 4.44 | 4.16 | 5.04 (0.07) | 6.01 (0.33) | 4.97 (0.28) | 4.40 |
WAV | 0.94 | 2.32 | 2.68 | 0.02 | 0.99 | 1.63 | 11.62 (4.81) | 11.92 (6.25) | 10.54 (1.05) | 3.89 |
AMS | 0.31 | 0.30 | −1.38 | 0.19 | −2.99 | −3.40 | 11.73 (5.96) | 10.97 (6.76) | 10.20 (4.90) | 1.71 |
PAC-MFCC | 0.00 | −0.33 | −0.82 | 0.18 | −0.92 | −0.67 | 0.95 (0.15) | 1.25 (0.26) | 1.17 (0.09) | −0.17 |
“Sim.” and “Rec.” Indicate Simulated and Recorded Room Impulse Responses. Boldface Indicates the Best Scores in Each Condition. In Cochannel (Two-Talker) Cases, the Performance is Shown Separately for a Female Interferer and Male Interferer (in Parentheses) with a Male Target Talker
The large performance differences caused by features in both Tables I and II demonstrate the importance of features for supervised speech separation. The inclusion of raw waveform signal in Table II further suggests that, without feature extraction, separation results are poor. But it should be noted that, the feed-forward DNN used in [34] may not couple well with waveform signals, and CNNs and RNNs may be better suited for so-called end-to-end separation. We will come to this issue later.
V. MONAURAL SEPARATION ALGORITHMS
In this section, we discuss monaural algorithms for speech enhancement, speech dereverberation as well as dereverberation plus denoising, and speaker separation. We explain representative algorithms and discuss generalization of supervised speech separation.
A. Speech Enhancement
To our knowledge, deep learning was first introduced to speech separation by Wang and Wang in 2012 in two conference papers [179], [180], which were later extended to a journal version in 2013 [181]. They used DNN for subband classification to estimate the IBM. In the conference versions, feedforward DNNs with RBM pretraining were used as binary classifiers, as well as feature encoders for structured perceptrons [179] and conditional random fields [180]. They reported strong separation results in all cases of DNN usage, with better results for DNN used for feature learning due to the incorporation of temporal dynamics in structured prediction.
In the journal version [181], the input signal is passed through a 64-channel gammatone filterbank to derive subband signals, from which acoustic features are extracted within each T-F unit. These features form the input to subband DNNs (64 in total) to learn more discriminative features. This use of DNN for speech separation is illustrated in Fig. 4. After DNN training, input features and learned features of the last hidden layer are concatenated and fed to linear SVMs to estimate the subband IBM efficiently. This algorithm was further extended to a two-stage DNN [65], where the first stage is trained to estimate the subband IBM as usual and the second stage explicitly incorporates the T-F context in the following way. After the first-stage DNN is trained, a unit-level output before binarization can be interpreted as the posterior probability that speech dominates the T-F unit. Hence the first-stage DNN output is considered a posterior mask. In the second stage, a T-F unit takes as input a local window of the posterior mask centered at the unit. The two-stage DNN is illustrated in Fig. 5. This second-stage structure is reminiscent of a convolutional layer in CNN but without weight sharing. This way of leveraging contextual information is shown to significantly improve classification accuracy. Subject tests demonstrate that this DNN produced large intelligibility improvements for both HI and NH listeners, with HI listeners benefiting more [65]. This is the first monaural algorithm to provide substantial speech intelligibility improvements for HI listeners in background noise, so much so that HI subjects with separation outperformed NH subjects without separation.
In 2013, Lu et al. [116] published an Interspeech paper that uses a deep autoencoder (DAE) for speech enhancement. A basic autoencoder (AE) is an unsupervised learning machine, typically having a symmetric architecture with one hidden layer with tied weights, that learns to map an input signal to itself. Multiple trained AEs can be stacked into a DAE that is then subject to traditional supervised fine-tuning, e.g., with a back-propagation algorithm. In other words, autoencoding is an alternative to RBM pretraining. The algorithm in [116] learns to map from the mel-frequency power spectrum of noisy speech to that of clean speech, so it can be regarded as the first mapping based method.4
Subsequently, but independent of [116], Xu et al. [196] published a study using a DNN with RBM pretraining to map from the log power spectrum of noisy speech to that of clean speech, as shown in Fig. 6. Unlike [116], the DNN used in [196] is a standard feedforward MLP with RBM pretraining. After training, DNN estimates clean speech’s spectrum from a noisy input. Their experimental results show that the trained DNN yields about 0.4 to 0.5 PESQ gains over noisy speech on untrained noises, which are higher than those obtained by a representative traditional enhancement method.
Many subsequent studies have since been published along the lines of T-F masking and spectral mapping. In [185], [186], RNNs with LSTM were used for speech enhancement and its application to robust ASR, where training aims for signal approximation (see Section III.I). RNNs were also used in [41] to estimate the PSM. In [132], [210], a deep stacking network was proposed for IBM estimation and a mask estimate was then used for pitch estimation. The accuracy of both mask estimation and pitch estimation improves after the two modules iterate for several cycles. A DNN was used to simultaneously estimate the real and imaginary components of the cIRM, yielding better speech quality over IRM estimation [188]. Speech enhancement at the phoneme level has been recently studied [18], [183]. In [59], the DNN takes into account of perceptual masking with a piecewise gain function. In [198], multi-objective learning is shown to improve enhancement performance. It has been demonstrated that a hierarchical DNN performing subband spectral mapping yields better enhancement than a single DNN performing full-band mapping [39]. In [161], skip connections between nonconsecutive layers are added to DNN to improve enhancement performance. Multi-target training with both masking and mapping based targets is found to outperform single-target training [205]. CNNs have also been used for IRM estimation [83] and spectral mapping [46], [136], [138].
Aside from masking and mapping based approaches, there is recent interest in using deep learning to perform end-toend separation, i.e., temporal mapping without resorting to a T-F representation. A potential advantage of this approach is to circumvent the need to use the phase of noisy speech in reconstructing enhanced speech, which can be a drag for speech quality, particularly when input SNR is low. Recently, Fu et al. [47] developed a fully convolutional network (a CNN with fully connected layers removed) for speech enhancement. They observe that full connections make it difficult to map both high and low frequency components of a waveform signal, and with their removal, enhancement results improve. As a convolution operator is the same as a filter or a feature extractor, CNNs appear to be a natural choice for temporal mapping.
A recent study employs a GAN to perform temporal mapping [138]. In the so-called speech enhancement GAN (SEGAN), the generator is a fully convolutional network, performing enhancement or denoising. The discriminator follows the same convolutional structure as G, and it transmits information of generated waveform signals versus clean signals back to G. D can be viewed as providing a trainable loss function for G. SEGAN was evaluated on untrained noisy conditions, but the results are inconclusive and worse than masking or mapping methods. In another GAN study [122], G tries to enhance the spectrogram of noisy speech while D tries to distinguish between the enhanced spectrograms and those of clean speech. The comparisons in [122] show that the enhancement results by this GAN are comparable to those achieved by a DNN.
Not all deep learning based speech enhancement methods build on DNNs. For example, Le Roux et al. [105] proposed deep NMF that unfolds NMF operations and includes multiplicative updates in backpropagation. Vu et al. [167] presented an NMF framework in which a DNN is trained to map NMF activation coefficients of noisy speech to their clean version.
B. Generalization of Speech Enhancement Algorithms
For any supervised learning task, generalization to untrained conditions is a crucial issue. In the case of speech enhancement, data-driven algorithms bear the burden of proof when it comes to generalization, because the issue does not arise in traditional speech enhancement and CASA algorithms which make minimal use of supervised training. Supervised enhancement has three aspects of generalization: noise, speaker, and SNR. Regarding SNR generalization, one can simply include more SNR levels in a training set and practical experience shows that supervised enhancement is not sensitive to precise SNRs used in training. Part of the reason is that, even though a few mixture SNRs are included in training, local SNRs at the frame level and T-F unit level usually vary over a wide range, providing a necessary variety for a learning machine to generalize well. An alternative strategy is to adopt progressive training with increasing numbers of hidden layers to handle lower SNR conditions [48].
In an effort to address the mismatch between training and test conditions, Kim and Smaragdis [98] proposed a two-stage DNN where the first stage is a standard DNN to perform spectral mapping and the second stage is an autoencoder that performs unsupervised adaptation during the test stage. The AE is trained to map the magnitude spectrum of a clean utterance to itself, much like [115], and hence its training does not need labeled data. The AE is then stacked on top of the DNN, and serves as a purity checker as shown in Fig. 7. The rationale is that well enhanced speech tends to produce a small difference (error) between the input and the output of the AE, whereas poorly enhanced speech should produce a large error. Given a test mixture, the already-trained DNN is fine-tuned with the error signal coming from the AE. The introduction of an AE module provides a way of unsupervised adaptation to test conditions that are quite different from the training conditions, and is shown to improve the performance of speech enhancement.
Noise generalization is fundamentally challenging as all kinds of stationary and nonstationary noises may interfere with a speech signal. When available training noises are limited, one technique is to expand training noises through noise perturbation, particularly frequency perturbation [23]; specifically, the spectrogram of an original noise sample is perturbed to generate new noise samples. To make the DNN-based mapping algorithm of Xu et al. [196] more robust to new noises, Xu et al. [195] incorporate noise aware training, i.e., the input feature vector includes an explicit noise estimate. With noise estimated via binary masking, the DNN with noise aware training generalizes better to untrained noises.
Noise generalization is systematically addressed in [24]. The DNN in this study was trained to estimate the IRM at the frame level. In addition, the IRM is simultaneously estimated over several consecutive frames and different estimates for the same frame are averaged to produce a smoother, more accurate mask (see also [178]). The DNN has five hidden layers with 2048 ReLUs in each. The input features for each frame are cochlea-gram response energies (the GF feature in Tables I and II). The training set includes 640,000 mixtures created from 560 IEEE sentences and 10,000 (10K) noises from a sound effect library (www.sound-ideas.com) at the fixed SNR of −2 dB. The total duration of the noises is about 125 hours, and the total duration of training mixtures is about 380 hours. To evaluate the impact of the number of training noises on noise generalization, the same DNN is also trained with 100 noises as done in [181]. The test sets are created using 160 IEEE sentences and nonstationary noises at various SNRs. Neither test sentences nor test noises are used during training. The separation results measured in STOI are shown in Table III, and large STOI improvements are obtained by the 10K-noise model. In addition, the 10K-noise model substantially outperforms the 100-noise model, and its average performance matches the noise-dependent models trained with the first half of the training noises and tested with the second half. Subject tests show that the noise-independent model resulting from large-scale training significantly improves speech intelligibility for NH and HI listeners in unseen noises. This study strongly suggests that large-scale training with a wide variety of noises is a promising way to address noise generalization.
TABLE III.
Babble1 | Cafeteria | Factory | Babble2 | Average | |
---|---|---|---|---|---|
Unprocessed | 0.612 | 0.596 | 0.611 | 0.611 | 0.608 |
100-noise model | 0.683 | 0.704 | 0.750 | 0.688 | 0.706 |
10K-noise model | 0.792 | 0.783 | 0.807 | 0.786 | 0.792 |
Noise-dependent model | 0.833 | 0.770 | 0.802 | 0.762 | 0.792 |
As for speaker generalization, a separation system trained on a specific speaker would not work well for a different speaker. A straight forward attempt for speaker generalization would be to train with a large number of speakers. However, experimental results [20], [100] show that a feedforward DNN appears incapable of modeling a large number of talkers. Such a DNN typically takes a window of acoustic features for mask estimation, without using the long-term context. Unable to track a target speaker, a feedforward network has a tendency to mistake noise fragments for target speech. RNNs naturally model temporal dependencies, and are thus expected to be more suitable for speaker generalization than feedforward DNN.
We have recently employed RNN with LSTM to address speaker generalization of noise-independent models [20]. The model, shown in Fig. 8, is trained on 3,200,000 mixtures created from 10,000 noises mixed with 6, 10, 20, 40, and 77 speakers. When tested with trained speakers, as shown in Fig. 9(a), the performance of the DNN degrades as more training speakers are added to the training set, whereas the LSTM benefits from additional training speakers. For untrained test speakers, as shown in Fig. 9(b), the LSTM substantially outperforms the DNN in terms of STOI. LSTM appears able to track a target speaker over time after being exposed to many speakers during training. With large-scale training with many speakers and numerous noises, RNNs with LSTM represent an effective approach for speaker-and noise-independent speech enhancement.
C. Speech Dereverberation and Denoising
In a real environment, speech is usually corrupted by reverberation from surface reflections. Room reverberation corresponds to a convolution of the direct signal and an RIR, and it distorts speech signals along both time and frequency. Reverberation is a well-recognized challenge in speech processing, particularly when it is combined with background noise. As a result, dereverberation has been actively investigated for a long time [5],[61], [131], [191].
Han et al. [57] proposed the first DNN based approach to speech dereverberation. This approach uses spectral mapping on a cochleagram. In other words, a DNN is trained to map from a window of reverberant speech frames to a frame of anechoic speech, as illustrated in Fig. 10. The trained DNN can reconstruct the cochleagram of anechoic speech with surprisingly high quality. In their later work [58], they apply spectral mapping on a spectrogram and extend the approach to perform both dereverberation and denoising.
A more sophisticated system was proposed recently by Wu et al. [190], who observe that dereverberation performance improves when frame length and shift are chosen differently depending on the reverberation time (T60). Based on this observation, their system includes T60 as a control parameter in feature extraction and DNN training. During the dereverberation stage, T60 is estimated and used to choose appropriate frame length and shift for feature extraction. This so-called reverberation-time-aware model is illustrated in Fig. 11. Their comparisons show an improvement in dereverberation performance over the DNN in [58].
To improve the estimation of anechoic speech from reverberant and noisy speech, Xiao et al. [194] proposed a DNN trained to predict static, delta and acceleration features at the same time. The static features are log magnitudes of clean speech, and the delta and acceleration features are derived from the static features. It is argued that DNN that predicts static features well should also predict delta and acceleration features well. The incorporation of dynamic features in the DNN structure helps to improve the estimation of static features for dereverberation.
Zhao et al. [211] observe that spectral mapping is more effective for dereverberation than T-F masking, whereas masking works better than mapping for denoising. Consequently, they construct a two-stage DNN where the first stage performs ratio masking for denoising and the second stage spectral mapping for dereverberation. Furthermore, to alleviate the adverse effects of using the phase of reverberant-noisy speech in resynthesizing the waveform signal of enhanced speech, this study extends the time-domain signal reconstruction technique in [182]. Here the training target is defined in the time-domain, but clean phase is used during training unlike in [182] where noisy phase is used. The two stages are individually trained first, and then jointly trained. The results in [211] show that the two-stage DNN model significantly outperforms the single-stage models for either mapping or masking.
D. Speaker Separation
The goal of speaker separation is to extract multiple speech signals, one for each speaker, from a mixture containing two or more voices. After deep learning was demonstrated to be capable of speech enhancement, DNN has been successfully applied to speaker separation under a similar framework, which is illustrated in Fig. 12 in the case of two-speaker or cochannel separation.
According to our literature search, Huang et al. [81] were the first to introduce DNN for this task. This study addresses two-speaker separation using both a feedforward DNN and an RNN. The authors argue that the summation of the spectra of two estimated sources at frame t, Ŝ1 (t) and Ŝ2 (t), is not guaranteed to equal the spectrum of the mixture. Therefore, a masking layer is added to the network, which produces two final outputs shown in the following equations:
(8) |
(9) |
where Y (t) denotes the mixture spectrum at t. This amounts to a signal approximation training target introduced in Section III.I. Both binary and ratio masking are found to be effective. In addition, discriminative training is applied to maximize the difference between one speaker and the estimated version of the other. During training, the following cost is minimized:
(10) |
where S1(t) and S2(t) denote the ground truth spectra for Speaker 1 and Speaker 2, respectively, and γ is a tunable parameter. Experimental results have shown that both the masking layer and discriminative training improve speaker separation [82].
A few months later, Du et al. [38] appeared to have independently proposed a DNN for speaker separation similar to [81]. In this study [38], the DNN is trained to estimate the log power spectrum of the target speaker from that of a cochannel mixture. In a different paper [162], they trained a DNN to map a cochannel signal to the spectrum of the target speaker as well as the spectrum of an interfering speaker, as illustrated in Fig. 12 (see[37] for an extended version). A notable extension compared to[81] is that these papers also address the situation where only the target speaker is the same between training and testing, while interfering speakers are different between training and testing.
In speaker separation, if the underlying speakers are not allowed to change from training to testing, this is the speaker-dependent situation. If interfering speakers are allowed to change, but the target speaker is fixed, this is called target-dependent speaker separation. In the least constrained case where none of the speakers are required to be the same between training and testing, this is called speaker-independent. From this perspective, Huang et al.’s approach is speaker dependent [81], [82] and the studies in [38], [162] deal with both speaker and target dependent separation. Their way of relaxing the constraint on interfering speakers is simply to train with cochannel mixtures of the target speaker and many interferers.
Zhang and Wang proposed a deep ensemble network to address speaker-dependent as well as target-dependent separation [206]. They employ multi-context networks to integrate temporal information at different resolutions. An ensemble is constructed by stacking multiple modules, each performing multi-context masking or mapping. Several training targets were examined in this study. For speaker-dependent separation, signal approximation is shown to be most effective; for target-dependent separation, a combination of ratio masking and signal approximation is most effective. Furthermore, the performance of target-dependent separation is close to that of speaker-dependent separation. Recently, Wang et al. [174] took a step further towards relaxing speaker dependency in talker separation. Their approach clusters each speaker into one of the four clusters (two for male and two for female), and then trains a DNN-based gender mixture detector to determine the clusters of the two underlying speakers in a mixture. Although trained on a subset of speakers in each cluster, their evaluation results show that the speaker separation approach works well for the other untrained speakers in each cluster; in other words, this speaker separation approach exhibits a degree of speaker independency.
Healy et al. [63] have recently used a DNN for speaker-dependent cochannel separation and performed speech intelligibility evaluation of the DNN with both HI and NH listeners. The DNN was trained to estimate the IRM and its complement, corresponding to the target talker and interfering talker. Compared to earlier DNN-based cochannel separation studies, the algorithm in [63] uses a diverse set of features and predicts multiple IRM frames, resulting in better separation. The intelligibility results are shown in Fig. 13. For the HI group, intelligibility improvement from DNN-based separation is 42.5, 49.2, and 58.7 percentage points at −3 dB, −6 dB, and 9 dB target-to-interferer ratio (TIR), respectively. For the NH−group, there are statistically significant improvements, but to a smaller extent. It is remarkable that the large intelligibility improvements obtained by HI listeners allow them to perform equivalently to NH listeners (without algorithm help) at the common TIRs of − 6 and −9 dB.
Speaker-independent separation can be treated as unsupervised clustering where T-F units are clustered into distinct classes dominated by individual speakers [6], [79]. Clustering is a flexible framework in terms of the number of speakers to separate, but it does not benefit as much from discriminative information fully utilized in supervised training. Hershey et al. were the first to address speaker-independent multi-talker separation in the DNN framework [69]. Their approach, called deep clustering, combines DNN based feature learning and spectral clustering. With a ground truth partition of T-F units, the affinity matrix A can be computed as:
(11) |
where Y is the indicator matrix built from the IBM. Yi,c is set to 1 if unit i belongs to (or dominated by) speaker c, and 0 otherwise. The DNN is trained to embed each T-F unit. The estimated affinity matrix  can be derived from the embeddings. The DNN learns to output similar embeddings for T-F units originating from the same speaker by minimizing the following cost function:
(12) |
where V is an embedding matrix for T-F units. Each row of V represents one T-F unit. denotes the squared Frobenius norm. Low rank formulation can be applied to efficiently calculate the cost function and its derivatives. During inference, a mixture is segmented and the embedding matrix V is computed for each segment. Then, the embedding matrices of all segments are concatenated. Finally, the K-means algorithm is applied to cluster the T-F units of all the segments into speaker clusters. Segment-level clustering is more accurate than utterance-level clustering, but with clustering results only for individual segments, the problem of sequential organization has to be addressed. Deep clustering is shown to produce high quality speaker separation, significantly better than a CASA method[79] and an NMF method for speaker-independent separation.
A recent extension of deep clustering is the deep attractor network [25], which also learns high-dimensional embeddings for T-F units. Unlike deep clustering, this deep network creates attractor points akin to cluster centers in order to pull T-F units dominated by different speakers to their corresponding attractors. Speaker separation is then performed as mask estimation by comparing embedded points and each attractor. The results in [25] show that the deep attractor network yields better results than deep clustering.
While clustering-based methods naturally lead to speaker-independent models, DNN based masking/mapping methods tie each output of the DNN to a specific speaker, and lead to speaker-dependent models. For example, mapping based methods minimize the following cost function:
(13) |
where and |Sk (t)| denote estimated and actual spectral magnitudes for speaker k, respectively, and t denotes time frame. To untie DNN outputs from speakers and train a speaker-independent model using a masking or mapping technique, Yu et al. [202] recently proposed permutation-invariant training, which is shown in Fig. 14. For two-speaker separation, a DNN is trained to output two masks, each of which is applied to noisy speech to produce a source estimate. During DNN training, the cost function is dynamically calculated. If we assign each output to a reference speaker |Sk (t)| in training data, there are two possible the assignments, each of which is associated with an MSE. The assignment with the lower MSE is chosen and the DNN is trained to minimize the corresponding MSE. During both training and inference, the DNN takes a segment or multiple frames of features, and estimates two sources for the segment. Since the two outputs of the DNN are not tied to any speaker, the same speaker may switch from one output to another across consecutive segments. Therefore, the estimated segment-level sources need to be sequentially organized unless segments are as long as utterances. Although much simpler, speaker separation results are shown to match those obtained with deep clustering [101], [202].
It should be noted that, although speaker separation evaluations typically focus on two-speaker mixtures, the separation framework can be generalized to separating more than two talkers. For example, the diagrams in both Figs. 12 and 14 can be straightforwardly extended to handle, say, three-talker mixtures. One can also train target-independent models using multi-speaker mixtures. For speaker-independent separation, deep clustering [69] and permutation-invariant training [101] are both formulated for multi-talker mixtures and evaluated on such data. Scaling deep clustering from mixtures of two speakers to more than two is more straightforward than for scaling permutation-invariant training.
An insight from the body of work overviewed in this speaker separation subsection is that a DNN model trained with many pairs of different speakers is able to separate a pair of speakers never included in training, a case of speaker independent separation, but only at the frame level. For speaker-independent separation, the key issue is how to group well-separated speech signals at individual frames (or segments) across time. This is precisely the issue of sequential organization, which is much investigated in CASA [172]. Permutation-invariant training may be considered as imposing sequential grouping constraints during DNN training. On the other hand, typical CASA methods utilize pitch contours, vocal tract characteristics, rhythm or prosody, and even common spatial direction when multiple sensors are available, which do not usually involve supervised learning. It seems to us that integrating traditional CASA techniques and deep learning is a fertile ground for future research.
VI. ARRAY SEPARATION ALGORITHMS
An array of microphones provides multiple monaural recordings, which contain information indicative of the spatial origin of a sound source. When sound sources are spatially separated, with sensor array inputs one may localize sound sources and then extract the source from the target location or direction. Traditional approaches to source separation based on spatial information include beamforming, as mentioned in Section I, and independent component analysis [3], [8], [85]. Sound localization and location-based grouping are among the classic topics in auditory perception and CASA [12], [15], [172].
A. Separation Based on Spatial Feature Extraction
The first study in supervised speech segregation was conducted by Roman et al. [141] in the binaural domain. This study performs supervised classification to estimate the IBM based on two binaural features: ITD and ILD, both extracted from individual T-F unit pairs from the left-ear and right-ear cochlea-gram. Note that, in this case, the IBM is defined on the noisy speech at a single ear (reference channel). Classification is based on maximum a posteriori (MAP) probability where the likelihood is given by a density estimation technique. Another classic two-sensor separation technique, DUET (Degenerate Unmixing Estimation Technique), was published by Yilmaz and Rickard [199] at about the same time. DUET is based on unsupervised clustering, and the spatial features used are phase and amplitude differences between the two microphones. The contrast between classification and clustering in these studies is a persistent theme and anticipates similar contrasts in later studies, e.g., binary masking [71] vs. clustering [72] for beamforming (see Section VI.B), and deep clustering [69] versus mask estimation [101] for talker-independent speaker separation (see Section V.D).
The use of spatial information afforded by an array as features in deep learning is a straightforward extension of the earlier use of DNN in monaural separation; one simply substitutes spatial features for monaural features. Indeed, this way of leveraging spatial information provides a natural framework for integrating monaural and spatial features for source separation, which is a point worth emphasizing as traditional research tends to pursue array separation without considering monaural grouping. It is worth noting that human auditory scene analysis integrates monaural and binaural analysis in a seamless fashion, taking advantage of whatever discriminant information existing in a particular environment [15], [30], [172].
The first study to employ DNN for binaural separation was published by Jiang et al. [90]. In this study, the signals from two ears (or microphones) are passed to two corresponding auditory filterbanks. ITD and ILD features are extracted from T-F unit pairs and sent to a subband DNN for IBM estimation, one DNN for each frequency channel. In addition, a monaural feature (GFCC, see Table I) is extracted from the left-ear input. A number of conclusions can be drawn from this study. Perhaps most important is the observation that the trained DNN generalizes well to untrained spatial configurations of sound sources. A spatial configuration refers to a specific placement of sound sources and sensors in an acoustic environment. This is key to the use of supervised learning as there are infinite configurations and a training set cannot enumerate various configurations. DNN based binaural separation is found to generalize well to RIRs and reverberation times. It is also observed that the incorporation of the monaural feature improves separation performance, especially when the target and interfering sources are co-located or close to each other.
Araki et al. [2] subsequently employed a DNN for spectral mapping that includes the spatial features of ILD, interaural phase difference (IPD), and enhanced features with an initial mask derived from location information, in addition to monaural input. Their evaluation with ASR related metrics shows that the best enhancement performance is obtained with a combination of monaural and enhanced features. Fan et al. [43] proposed a spectral mapping approach utilizing both binaural and monaural inputs. For the binaural features, this study uses subband ILDs, which are found to be more effective than fullband ILDs. These features are then concatenated with the left-ear’s frame-level log power spectra to form the input to the DNN, which is trained to map to the spectrum of clean speech. A quantitative comparison with [90] shows that their system produces better PESQ scores for separated speech but similar STOI numbers.
A more sophisticated binaural separation algorithm was proposed by Yu et al. [203]. The spatial features used include IPD, ILD, and a so-called mixing vector that is a form of combined STFT values of a unit pair. The DNN used is a DAE, first trained unsupervisedly as autoencoders that are subsequently stacked into a DNN subject to supervised fine-tuning. Extracted spatial features are first mapped to high-level features indicating spatial directions via unsupervised DAE training. For separation, a classifier is trained to map high-level spatial features to a discretized range of source directions. This algorithm operates over subbands, each covering a block of consecutive frequency channels.
Recently, Zhang and Wang [208] developed a DNN for IRM estimation with a more sophisticated set of spatial and spectral features. Their algorithm is illustrated in Fig. 15, where the left-ear and right-ear inputs are fed to two different modules for spectral (monaural) and spatial (binaural) analysis. Instead of monaural analysis on a single ear [90], [43], spectral analysis in [208] is conducted on the output of a fixed beamformer, which itself removes some background inference, by extracting a complementary set of monaural features (see Section IV). For spatial analysis, ITD in the form of a cross-correlation function, and ILD are extracted. The spectral and spatial features are concatenated to form the input to a DNN for IRM estimation at the frame level. This algorithm is shown to produce substantially better separation results in reverberant multisource environments than conventional beamformers, including MVDR (Minimum Variance Distortionless Response) and MWF (Multichannel Wiener Filter). An interesting observation from their analysis is that much of the benefit of using a beamformer prior to spectral feature extraction can be obtained simply by concatenating monaural features from the two ears.
Although the above methods are all binaural, involving two sensors, the extension from two sensors to an array with N sensors, with N > 2, is usually straightforward. Take the system in Fig. 15, for instance. With N microphones, spectral feature extraction requires no change as traditional beamformers are already formulated for an arbitrary number of microphones. For spatial feature extraction, the feature space needs to be expanded when more than two sensors are available, either by designating one microphone as a reference for deriving a set of “binaural” features or by considering a matrix of all sensor pairs in a correlation or covariance analysis. The output is a T-F mask or spectral envelope corresponding to target speech, which may be viewed as monaural. Since traditional beamforming with an array also produces a “monaural” output, corresponding to the target source, T-F masking based on spatial features may be considered beamforming or, more accurately, nonlinear beam-forming [125] as opposed to traditional beamforming that is linear.
B. Time-Frequency Masking for Beamforming
Beamforming, as the name would suggest, tunes in the signals from a zone of arrival angles centered at a given angle, while tuning out the signals outside the zone. To be applicable, a beamformer needs to know the target direction to steer the beamformer. Such a steering vector is typically supplied by estimating the direction-of-arrival (DOA) of the target source, or more broadly sound localization. In reverberant, multi-source environments, localizing the target sound is far from trivial. It is well recognized in CASA that localization and separation are two closely related functions ([172], Chapter 5). For hu man audition, evidence suggests that sound localization largely depends on source separation [30], [60].
Fueled by the CHiME-3 challenge for robust ASR, two independent studies made the first use of DNN based monaural speech enhancement in conjunction with conventional beam-forming, both published in ICASSP 2016 [71], [72]. The CHiME-3 challenge provides noisy speech data from a single speaker recorded by 6 microphones mounted on a tablet [7]. In these two studies, monaural speech separation provides the basis for computing the steering vector, cleverly bypassing two tasks that would have been required via the DOA estimation: localizing multiple sound sources and selecting the target (speech) source. To explain their idea, let us first describe MVDR as a representative beamformer.
MVDR aims to minimize the noise energy from nontarget directions while imposing linear constraints to maintain the energy from the target direction [45]. The captured signals of an array in the STFT domain can be written as:
(14) |
where y(t, f) and n(t, f) denote the STFT spatial vectors of the noisy speech signal and noise at frame t and frequency f, respectively, and s(t, f) denotes the STFT of the speech source. The term c(f)s(t, f) denotes the received speech signal by the array and c(f) is the steering vector of the array.
At frequency f, the MVDR beamformer identifies a weight vector w(f) that minimizes the average output power of the beamformer while maintaining the energy along the look (target) direction. Omitting f for brevity, this optimization problem can be formulated as
(15) |
where H denotes the conjugate transpose and Φn is the spatial covariance matrix of the noise. Note that the minimization of the output power is equivalent to the minimization of the noise power. The solution to this quadratic optimization problem is:
(16) |
The enhanced speech signal is given by
(17) |
Hence, the accurate estimation of c and Φn is key to MVDR beamforming. Furthermore, c corresponds to the principal component of Φx (the eigenvector with the largest eigenvalue), the spatial covariance matrix of speech. With speech and noise un-correlated, we have
(18) |
Therefore, a noise estimate is crucial for beamforming performance, just like it is for traditional speech enhancement.
In [71], an RNN with bidirectional LSTM is used for IBM estimation. A common neural network is trained monaurally on the data from each of the sensors. Then the trained network is used to produce a binary mask for each microphone recording, and the multiple masks are combined into one mask with a median operation. The single mask is used to estimate the speech and noise covariance matrix, from which beamformer coefficients are obtained. Their results show that MVDR does not work as well as the GEV (generalized eigenvector) beamformer. In[72], a spatial clustering based approach was proposed to compute a ratio mask. This approach uses a complex-domain GMM (cGMM) to describe the distribution of the T-F units dominated by noise and another cGMM to describe that of the units with both speech and noise. After parameter estimation, the two cGMMs are used for calculating the covariance matrices of noisy speech and noise, which are fed to an MVDR beamformer for speech separation. Both of these algorithms perform very well, and Higuchi et al.’s method was used in the best performing system in the CHiME-3 challenge [200]. A similar approach, i.e., DNN-based IRM estimation combined with a beamformer, is also behind the winning system in the most recent CHiME-4 challenge [36].
A method different from the above two studies was given by Nugraha et al. [133], who perform array source separation using DNN for monaural separation and a complex multivariate Gaussian distribution to model spatial information. The DNN in this study is used to model source spectra, or spectral mapping. The power spectral densities (PSDs) and spatial covariance matrices of speech and noise are estimated and updated iteratively. Fig. 16 illustrates the processing pipeline. First, array signals are realigned on the basis of time difference of arrival (TDOA) and averaged to form a monaural signal. A DNN is then used to produce an initial estimate of noise and speech PSDs. During the iterative estimation of PSDs and spatial covariance matrices, DNNs are used to further improve the PSDs estimated by a multichannel Wiener filter. Finally, the estimated speech signals from multiple microphones are averaged to produce a single speech estimate for ASR evaluation. A number of design choices were examined in this study, and their algorithm yields better separation and ASR results than DNN based monaural separation and an array version of NMF-based separation.
The success of Higuchi et al. [72] and Heymann et al. [71] in the CHiME-3 challenge by using DNN estimated masks for beamforming has motivated many recent studies, exploring different ways of integrating T-F masking and beamforming. Erdogan et al. [42] trained an RNN for monaural speech enhancement, from which a ratio mask is computed in order to provide coefficients for an MVDR beamformer. As illustrated in Fig. 17, a ratio mask is first estimated for each microphone. Then multiple masks from an array are combined into one mask by a maximum operator, which is found to produce better results than using multiple masks without combination. It should be noted that their ASR results on the CHiME-3 data are not compelling. Instead of fixed beamformers like MVDR, beam-forming coefficients can be dynamically predicted by a DNN. Li et al. [108] employed a deep network to predict spatial filters from array inputs of noisy speech for adaptive beamforming. Waveform signals are sent to a shared RNN, whose output is sent to two separate RNNs to predict beamforming filters for two microphones.
Zhang et al. [209] trained a DNN for IRM estimation from a complementary set of monaural features, and then combined multiple ratio masks from an array into a single one with a maximum operator. The ratio mask is used for calculating the noise spatial covariance matrix at time t for an MVDR beamformer as follows,
(19) |
where RM(l, f) denotes the estimated IRM from the DNN at frame l and frequency f. An element of the noise covariance matrix is calculated per frame by integrating a window of neighboring 2L + 1 frames. They find this adaptive way of estimating the noise covariance matrix to perform much better than estimation over the entire utterance or a signal segment. An enhanced speech signal from the beamformer is then fed to the DNN to refine the IRM estimate, and mask estimation and beamforming iterate several times to produce the final output. Their 5.05 WER (word error rate) on the CHiME-3 real evaluation data represents a 13.34% relative improvement over the previous best [200]. Independently, Xiao et al. [193] also proposed to iterate ratio masking and beamforming. They use an RNN for estimating a speech mask and a noise mask. Mask refinement is based on an ASR loss, in order to directly benefit ASR performance. They showed that this approach leads to a considerable WER reduction over the use of a conventional MVDR, although recognition accuracy is not as high as in [200].
Other related studies include Pfeifenberger et al. [139], who use the cosine distance between the principal components of consecutive frames of noisy speech as the feature for DNN mask estimation. Meng et al. [121] use RNNs for adaptive estimation of beamformer coefficients. Their ASR results on the CHiME-3 data are better than the baseline scores, but are far from the best scores. Nakatani et al. [129] integrate DNN mask estimation and cGMM clustering based estimation to further improve the quality of mask estimates. Their results on the CHiME-3 data improve over those obtained from RNN or cGMM generated masks.
VII. DISCUSSION AND CONCLUSION
This paper has provided a comprehensive overview of DNN based supervised speech separation. We have summarized key components of supervised separation, i.e., learning machines, training targets, and acoustic features, explained representative algorithms, and reviewed a large number of related studies. With the formulation of the separation problem as supervised learning, DNN based separation over a short few years has greatly elevated the state-of-the-art for a wide range of speech separation tasks, including monaural speech enhancement, speech dereverberation, and speaker separation, as well as array speech separation. This rapid advance will likely continue with a tighter integration of domain knowledge and the data-driven framework and the progress in deep learning itself.
Below we discuss several conceptual issues pertinent to this overview.
A. Features vs. Learning Machines
As discussed in Section IV, features are important for speech separation. However, a main appeal of deep learning is to learn appropriate features for a task, rather than to design such features. So is there a role for feature extraction in the era of deep learning? We believe the answer is yes. The so-called no-free-lunch theorem [189] dictates that no learning algorithm, DNN included, achieves superior performance in all tasks. Aside from theoretical arguments, feature extraction is a way of imparting knowledge from a problem domain and it stands to reason that it is useful to incorporate domain knowledge this way (see [176] for a recent example). For instance, the success of CNN in visual pattern recognition is partly due to the use of shared weights and pooling (sampling) layers in its architecture that helps to build a representation invariant to small variations of feature positions [10].
It is possible to learn useful features for a problem domain, but doing so may not be computationally efficient, particularly when certain features are known to be discriminative through domain research. Take pitch, for example. Much research in auditory scene analysis shows that pitch is a primary cue for auditory organization [15], [30], and research in CASA demonstrates that pitch alone can go a long way in separating voiced speech[78]. Perhaps a DNN can be trained to “discover” harmonicity as a prominent feature, and there is some hint at this from a recent study [24], but extracting pitch as input features seems like the most straightforward way of incorporating pitch in speech separation.
The above discussion is not meant to discount the importance of learning machines, as this overview has made it abundantly clear, but to argue for the relevance of feature extraction despite the power of deep learning. As mentioned in Section V.A, convolutional layers in a CNN amount to feature extraction. Although CNN weights are trained, the use of a particular CNN architecture reflects design choices of its user.
B. Time-Frequency Domain vs. Time Domain
The vast majority of supervised speech separation studies are conducted in the T-F domain as reflected in the various training targets reviewed in Section III. Alternatively, speech separation can be conducted in the time domain without recourse to a frequency representation. As pointed out in Section V.A, through temporal mapping both magnitude and phase can potentially be cleaned at once. End-to-end separation represents an emergent trend along with the use of CNNs and GANs.
A few comments are in order. First, temporal mapping is a welcome addition to the list of supervised separation approaches and provides a unique perspective to phase enhancement [50], [103]. Second, the same signal can be transformed back and forth between its time domain representation and its T-F domain representation. Third, the human auditory system has a frequency dimension at the beginning of the auditory pathway, i.e., at the cochlea. It is interesting to note Licklider’s classic duplex theory of pitch perception, postulating two processes of pitch analysis: a spatial process corresponding to the frequency dimension in the cochlea and a temporal process corresponding to the temporal response of each frequency channel [111]. Computational models for pitch estimation fall into three categories: spectral, temporal, and spectrotemporal approaches [33]. In this sense, a cochleagram, with the individual responses of a cochlear filterbank [118], [172], is a duplex representation.
C. What’s the Target?
When multiple sounds are present in the acoustic environment, which should be treated as the target sound at a particular time? The definition of ideal masks presumes that the target source is known, which is often the case in speech separation applications. For speech enhancement, the speech signal is considered the target while nonspeech signals are considered the interference. The situation becomes tricky for multi-speaker separation. In general, this is the issue of auditory attention and intention. It is a complicated issue as what is attended to shifts from one moment to the next even with the same input scene, and does not have to be a speech signal. There are, however, practical solutions. For example, directional hearing aids get around this issue by assuming that the target lies in the look direction, i.e., benefiting from vision [35], [170]. With sources separated, there are other reasonable alternatives for target definition, e.g., the loudest source, the previously attended one (i.e., tracking), or the most familiar (as in the multi-speaker case). A full account, however, would require a sophistical model of auditory attention (see [118], [172]).
D. What Does a Solution to the Cocktail Party Problem Look Like?
CASA defines the solution to the cocktail party problem as a system that achieves human separation performance in all listening conditions ([172], p. 28). But how to actually compare the separation performance by a machine and that by a human listener? Perhaps a straightforward way would be compare ASR scores and human speech intelligibility scores in various listening conditions. This is a tall order as ASR performance still falls short in realistic conditions despite tremendous recent advances thanks to deep learning. A drawback with ASR evaluation is the dependency on ASR with all its peculiarities.
Here we suggest a different, concrete measure: a solution to the cocktail party is a separation system that elevates speech intelligibility of hearing-impaired listeners to the level of normal-hearing listeners in all listening situations. Not as broad as defined in CASA, but this definition has the benefit that it is tightly linked to a primary driver for speech separation research, namely, to eliminate the speech understanding handicap of millions of listeners with impaired hearing [171]. By this definition, the DNN based speech enhancement described above has met the criterion in limited conditions (see Fig. 13 for one example), but clearly not in all conditions. Versatility is the hallmark of human intelligence, and the primary challenge facing supervised speech separation research today.
Before closing, we point out that the use of supervised learning and DNN in signal processing goes beyond speech separation, and automatic speech and speaker recognition. The related topics include multipitch tracking [56], [80], voice activity detection [207], and even a task as basic in signal processing as SNR estimation [134]. No matter the task, once it is formulated as a data-driven problem, advances will likely ensue with the use of various deep learning models and suitably constructed training sets; it should also be mentioned that these advances come at the expense of high computational complexity involved in the training process and often in operating a trained DNN model. A considerable benefit of treating signal processing as learning is that signal processing can ride on the progress of machine learning, a rapidly advancing field.
Finally, we remark that human ability to solve the cocktail party problem appears to have much to do with our extensive exposure to various noisy environments (see also [24]). Research indicates that children have poorer ability to recognize speech in noise than adults [54], [92], and musicians are better at perceiving noisy speech than non-musicians [135] presumably due to musicians’ long exposure to polyphonic signals. Relative to monolingual speakers, bilinguals have a deficit when it comes to speech perception in noise, although the two groups are similarly proficient in quiet [159]. All these effects support the notion that extensive training (experience) is part of the reason for the remarkable robustness of the normal auditory system to acoustic interference.
ACKNOWLEDGMENT
The authors would like to thank M. Delfarah for help in manuscript preparation and also J. Du, Y. Tsao, Y. Wang, Y. Xu, and X. Zhang for helpful comments on an earlier version.
This work was supported in part by the AFOSR under Grant FA9550-12-1-0130, in part by the NIDCD under Grant R01 DC012048, and in part by the National Science Foundation under Grant IIS-1409431. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Alexey Ozerov.
Biography
DeLiang Wang (M’90–SM’01–F’04) received the B.S. degree in 1983 and the M.S. degree in 1986 from Peking University, Beijing, China, and the Ph.D. degree in 1991 from the University of Southern California, Los Angeles, CA, USA, all in computer science. Since 1991, he has been with the Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, Ohio State University, Columbus, OH, USA, where he is currently a Professor. He has been a Visiting Scholar to Harvard University, Oticon A/S, Denmark, Starkey Hearing Technologies, and Northwestern Polytechnical University, Xi’an, China. His research interests include machine perception and neurodynamics. He was the recipient of the Office of Naval Research Young Investigator Award in 1996, the 2005 Outstanding Paper Award from IEEE TRANSACTIONS ON NEURAL NETWORKS, and the 2008 Helmholtz Award from the International Neural Network Society. In 2014, he was named a University Distinguished Scholar by Ohio State University. He is Co-Editor-in-Chief of Neural Networks.
Jitong Chen received the B.E. degree in information security from Northeastern University, Shenyang, China, in 2011. He received the Ph.D. degree in computer science and engineering from The Ohio State University, Columbus, OH, USA, in 2017. He is currently a Research Scientist with Silicon Valley AI Lab with Baidu Research. He previously interned at MetaMind and Google Research. His research interests include speech separation, robust automatic speech recognition, speech synthesis, machine learning, and signal processing.
Footnotes
More straightforwardly a correlation.
The conclusion is also nuanced for speaker separation [206].
This was first pointed out by Hakan Erdogan in personal communication.
The authors also published a paper in Interspeech 2012 [115] where a DAE is trained in an unsupervised fashion to map from the mel-spectrum of clean speech to itself. The trained DAE is then used to “recall” a clean signal from a noisy input for robust ASR.
Contributor Information
DeLiang Wang, Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA, and also with the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi’an 710072, China..
Jitong Chen, Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA. He is now with Silicon Valley AI Lab, Baidu Research, Sunnyvale, CA 94089 USA..
REFERENCES
- [1].Anzalone MC, Calandruccio L, Doherty KA, and Carney LH, “Determination of the potential benefit of time-frequency gain manipulation,” Ear Hearing, vol. 27, pp. 480–492, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Araki S et al. , “Exploring multi-channel features for denoisingautoencoder-based speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2015, pp. 116–120 [Google Scholar]
- [3].Araki S, Sawada H, Mukai R, and Makino S, “Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors,” Signal Process, vol. 87, pp. 1833–1847, 2007. [Google Scholar]
- [4].Assmann P and Summerfield AQ, “The perception of speech under adverse conditions,” in Speech Processing in the Auditory System, Greenberg S, Ainsworth WA, Popper AN, and Fay RR, Eds. New York, NY, USA: Springer, 2004, pp. 231–308. [Google Scholar]
- [5].Avendano C and Hermansky H, “Study on the dereverberation of speech based on temporal envelope filtering,” in Proc. 4th Int. Conf. Spoken Lang. Process, 1996, pp. 889–892. [Google Scholar]
- [6].Bach FR and Jordan MI, “Learning spectral clustering, with application to speech separation,” J. Mach. Learn. Res, vol. 7, pp. 1963–2001, 2006. [Google Scholar]
- [7].Barker J, Marxer R, Vincent E, and Watanabe A, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2015, pp. 5210–5214. [Google Scholar]
- [8].Bell AJ and Sejnowski TJ, “An information-maximization approach to blind separation and blind deconvolution,” Neural Comput, vol. 7, pp. 1129–1159, 1995. [DOI] [PubMed] [Google Scholar]
- [9].Benesty J, Chen J, and Huang Y, Microphone Array Signal Processing. Berlin, Germany: Springer, 2008. [Google Scholar]
- [10].Bengio Y and LeCun Y, “Scaling learning algorithms towards AI,” in Large-Scale Kernel Machines, Bottou L, e O, DeCoste D, and Weston J, Eds. Cambridge, MA, USA: MIT Press, 2007, pp. 321–359 [Google Scholar]
- [11].Bey C and McAdams S, “Schema-based processing in auditory scene analysis,” Perception Psychophys, vol. 64, pp. 844–854, 2002. [DOI] [PubMed] [Google Scholar]
- [12].Blauert J, Spatial Hearing: The Psychophysics of Human Sound Localization. Cambridge, MA, USA: MIT Press, 1983. [Google Scholar]
- [13].Boll SF, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process, vol. ASSP-27, no. 2, pp. 113–120, Apr. 1979. [Google Scholar]
- [14].Brandstein MS and Ward DB, Eds., Microphone Arrays: Signal Processing Techniques and Applications. New York, NY, USA: Springer, 2001. [Google Scholar]
- [15].Bregman AS, Auditory Scene Analysis. Cambridge, MA, USA: MIT Press, 1990. [Google Scholar]
- [16].Brungart DS, Chang PS, Simpson BD, and Wang DL, “Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation,” J. Acoust. Soc. Amer, vol. 120, pp. 4007–4018, 2006. [DOI] [PubMed] [Google Scholar]
- [17].Carlyon RP, Cusack R, Foxton JM, and Robertson IH, “Effects of attention and unilateral neglect on auditory stream segregation,” J. Exp. Psychol., Human Perception Perform, vol. 27, pp. 115–127, 2001. [DOI] [PubMed] [Google Scholar]
- [18].Chazan SE, Gannot S, and Goldberger J, “A phoneme-based pre-training approach for deep neural network with application to speech enhancement,” in Proc. Int. Workshop Acoust. Echo Noise Control, 2016, pp. 1–5. [Google Scholar]
- [19].Chen C and Bilmes JA, “MVA processing of speech features,” IEEE Trans. Audio, Speech, Lang. Process, vol. 15, no. 1, pp. 257–270, Jan. 2007. [Google Scholar]
- [20].Chen J and Wang DL, “Long short-term memory for speaker generalization in supervised speech separation,” in Proc. INTERSPEECH, 2016, pp. 3314–3318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Chen J and Wang DL, “DNN-based mask estimation for supervised speech separation,” in Audio Source Separation, Makino S, Ed. Berlin, Germany: Springer, 2018, pp. 207–235. [Google Scholar]
- [22].Chen J, Wang Y, and Wang DL, “A feature study for classification-based speech separation at low signal-to-noise ratios,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 22, no. 12 pp. 1993–2002, Dec. 2014. [Google Scholar]
- [23].Chen J, Wang Y, and Wang DL, “Noise perturbation for supervised speech separation,” Speech Commun, vol. 78, pp. 1–10, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Chen J, Wang Y, Yoho SE, Wang DL, and Healy EW, “Large-scale training to increase speech intelligibility for hearing-imparied listeners in novel noises,” J. Acoust. Soc. Amer, vol. 139, pp. 2604–2612, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Chen Z, Luo Y, and Mesgarani N, “Deep attractor network for single-microphone speaker separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 246–250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Cherry EC, “Some experiments on the recognition of speech, with one and with two ears,” J. Acoust. Soc. Amer, vol. 25, pp. 975–979, 1953. [Google Scholar]
- [27].Cherry EC, On Human Communication. Cambridge, MA, USA: MIT Press, 1957. [Google Scholar]
- [28].Ciresan DC, Meier U, and Schmidhuber J, “Multi-column deep neural networks for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3642–3649. [Google Scholar]
- [29].Darwin CJ, “Auditory grouping,” Trends Cogn. Sci, vol. 1, pp. 327–333, 1997. [DOI] [PubMed] [Google Scholar]
- [30].Darwin CJ, “Listening to speech in the presence of other sounds,” Philos. Trans. Roy. Soc. B, vol. 363, pp. 1011–1021, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Darwin CJ and Hukin RW, “Effectiveness of spatial cues, prosody, and talker characteristics in selective attention,” J. Acoust. Soc. Amer, vol. 107, pp. 970–977, 2000. [DOI] [PubMed] [Google Scholar]
- [32].David M, Lavandier M, Grimault N, and Oxenham A, “Sequential stream segregation of voiced and unvoiced speech sounds based on fundamental frequency,” Hearing Res, vol. 344, pp. 235–243, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].de Cheveigne A, “Multiple F0 estimation,” in Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, Wang DL and Brown GJ, Eds. Hoboken, NJ, USA: Wiley, 2006, pp. 45–79. [Google Scholar]
- [34].Delfarah M and Wang DL, “Features for masking-based monaural speech separation in reverberant conditions,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 25, no. 5, pp. 1085–1094, May 2017. [Google Scholar]
- [35].Dillon H, Hearing Aids, 2nd ed Turramurra, NSW, Australia: Boomerang, 2012. [Google Scholar]
- [36].Du J et al. , “The USTC-iFlyteck system for the CHiME4 challenge,” in Proc. 4th Int. Workshop Speech Process. Everyday Environ. (CHiME-4), 2016. [Google Scholar]
- [37].Du J, Tu Y, Dai L-R, and Lee C-H, “A regression approach to single-channel speech separation via high-resolution deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, no. 8, pp. 1424–1437, Aug. 2016. [Google Scholar]
- [38].Du J, Tu Y, Xu Y, Dai L-R, and Lee C-H, “Speech separation of a target speaker based on deep neural networks,” in Proc. 12th Int. Conf. Signal Process., 2014, pp. 65–68. [Google Scholar]
- [39].Du J and Xu Y, “Hierarchical deep neural network for multivariate regresss,” Pattern Recognit, vol. 63, pp. 149–157, 2017. [Google Scholar]
- [40].Ephraim Y and Malah D, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process, vol. ASSP-32, no. 6, pp. 1109–1121, Dec. 1984. [Google Scholar]
- [41].Erdogan H, Hershey J, Watanabe S, and Le Roux J, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2015, pp. 708–712. [Google Scholar]
- [42].Erdogan H, Hershey JR, Watanabe S, Mandel M, and Roux JL, “Improved MVDR beamforming using single-channel mask prediction networks,” in Proc. INTERSPEECH, 2016, pp. 1981–1985. [Google Scholar]
- [43].Fan N, Du J, and Dai L-R, “A regression approach to binaural speech segregation via deep neural network,” in Proc. 10th Int. Symp. Chin. Spoken Lang. Process., 2016, pp. 116–120. [Google Scholar]
- [44].Festen JM and Plomp R, “Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing,” J. Acoust. Echo Noise Suppress, vol. 88, pp. 1725–1736, 1990. [DOI] [PubMed] [Google Scholar]
- [45].Frost OL, “An algorithm for linearly constrained adaptive array processing,” Proc. IEEE, vol. 60, no. 8, pp. 926–935, Aug. 1972. [Google Scholar]
- [46].Fu S-W, Tsao Y, and Lu X, “SNR-aware convolutional neural network modeling for speech enhancement,” in Proc. INTERSPEECH, 2016, pp. 3678–3772. [Google Scholar]
- [47].Fu S-W, Tsao Y, Lu X, and Kawai H, “Raw waveform-based speech enhancement by fully convolutional networks,” in arXiv:1703.02205v3, 2017. [Google Scholar]
- [48].Gao T, Du J, Dai L-R, and Lee C-H, “SNR-based progressive learning of deep neural network for speech enhancement,” in Proc. INTER-SPEECH, 2016, pp. 3713–3717. [Google Scholar]
- [49].Gaudrian E, Grimault N, Healy EW, and Béra J-C, “The relationship between concurrent stream segregation, pitch-based streaming of vowel sequences, and frequency selectivity,” Acta Acoustica United Acustica, vol. 98, pp. 317–327, 2012. [Google Scholar]
- [50].Gerkmann T, Krawczyk-Becker M, and Le Roux J, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Process. Mag, vol. 32, no. 2, pp. 55–66, Mar. 2015. [Google Scholar]
- [51].Gonzalez S and Brookes M, “Mask-based enhancement for very low quality speech,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2014, pp. 7029–7033. [Google Scholar]
- [52].Goodfellow IJ et al. , “Generative adversarial nets,” in Proc. Neural Inf. Process. Syst, 2014, pp. 2672–2680. [Google Scholar]
- [53].Graves A, Liwicki M, Fernandez S, Bertolami R, Bunke H, and Schmidhuber J, “A novel connectionist system for unconstrained handwriting recognition,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 31, no. 5, pp. 855–868, May 2009. [DOI] [PubMed] [Google Scholar]
- [54].Hall JW, Grose JH, Buss E, and Dev MB, “Spondee recognition in a two-talker and a speech-shaped noise masker in adults and children,” Ear Hearing, vol. 23, pp. 159–165, 2002. [DOI] [PubMed] [Google Scholar]
- [55].Han K and Wang DL, “A classification based approach to speech separation,” J. Acoust. Soc. Amer, vol. 132, pp. 3475–3483, 2012. [DOI] [PubMed] [Google Scholar]
- [56].Han K and Wang DL, “Neural network based pitch tracking in very noisy speech,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 22, no. 12, pp. 2158–2168, Dec. 2014. [Google Scholar]
- [57].Han K, Wang Y, and Wang DL, “Learning spectral mapping for speech dereverebaration,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2014, pp. 4661–4665. [Google Scholar]
- [58].Han K, Wang Y, Wang D, Woods WS, Merks I, and Zhang T, “Learning spectral mapping for speech dereverberation and denoising,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 23, no. 6, pp. 982–992, Jun. 2015. [Google Scholar]
- [59].Han W, Zhang X, Sun M, Shi W, Chen X, and Hu Y, “Perceptual improvement of deep neural networks for monaural speech enhancement,” in Proc. Int. Workshop Acoust. Echo Noise Control, 2016. [Google Scholar]
- [60].Hartmann WM, “How we localize sounds,” Phys. Today, vol. 52, pp. 24–29, Nov. 1999. [Google Scholar]
- [61].Hazrati O, Lee J, and Loizou PC, “Blind binary masking for reverberation suppression in cochlear implants,” J. Acoust. Soc. Amer, vol. 133, pp. 1607–1614, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778. [Google Scholar]
- [63].Healy EW, Delfarah M, Vasko JL, Carter BL, and Wang DL, “An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker,” J. Acoust. Soc. Amer, vol. 141, pp. 4230–4239, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [64].Healy EW, Yoho SE, Chen J, Wang Y, and Wang DL, “An algorithm to increase speech intelligibility for hearing-impaired listeners in novel segments of the same noise type,” J. Acoust. Soc. Amer, vol. 138, pp. 1660–1669, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [65].Healy EW, Yoho SE, Wang Y, and Wang DL, “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. Acoust. Soc. Amer, vol. 134, pp. 3029–3038, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [66].Hendriks RC, Heusdens R, and Jensen J, “MMSE based noise PSD tracking with low complexity,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2010, pp. 4266–4269. [Google Scholar]
- [67].Hermansky H, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Amer, vol. 87, pp. 1738–1752, 1990. [DOI] [PubMed] [Google Scholar]
- [68].Hermansky H and Morgan N, “RASTA processing of speech,” IEEE Trans. Speech Audio Process, vol. 2, no. 4, pp. 578–589, Oct. 1994. [Google Scholar]
- [69].Hershey J, Chen Z, Le Roux J, and Watanabe S, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2016, pp. 31–35. [Google Scholar]
- [70].Hertz H, Krogh A, and Palmer RG, Introduction to the Theory of Neural Computation. Redwood City, CA, USA: Addison-Wesley, 1991. [Google Scholar]
- [71].Heymann J, Drude L, and Haeb-Umbach R, “Neural network based spectral mask estimation for acoustic beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 196–200. [Google Scholar]
- [72].Higuchi T, Ito N, Yoshioka T, and Nakatani T, “Robust MVDR beam-forming using time-frequency masks for online/offline ASR in noise,” in Proc. Int. Conf. Acoust. Speech Signal Process., 2016, pp. 5210–5214. [Google Scholar]
- [73].Hinton G et al. , “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Process. Mag, vol. 29, no. 6, pp. 82–97, Nov. 2012. [Google Scholar]
- [74].Hinton GE, Osindero S, and Teh Y-W, “A fast learning algorithm for deep belief nets,” Neural Comput, vol. 18, pp. 1527–1554, 2006. [DOI] [PubMed] [Google Scholar]
- [75].Hochreiter S and Schmidhuber J, “Long short-term memory,” Neural Comput, vol. 9, pp. 1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
- [76].Hu G and Wang DL, “Speech segregation based on pitch tracking and amplitude modulation,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., 2001, pp. 79–82. [Google Scholar]
- [77].Hu G and Wang DL, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Netw, vol. 15, no. 5, pp. 1135–1150, Sep. 2004. [DOI] [PubMed] [Google Scholar]
- [78].Hu G and Wang DL, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Trans. Audio, Speech, Lang. Process, vol. 18, no. 8, pp. 2067–2079, Nov. 2010. [Google Scholar]
- [79].Hu K and Wang DL, “An unsupervised approach to cochannel speech separation,” IEEE Trans. Audio, Speech, Lang. Process, vol. 21, no. 1, pp. 122–131, Jan. 2013. [Google Scholar]
- [80].Huang F and Lee T, “Pitch estimation in noisy speech using accumulated peak spectrum and sparse estimation technique,” IEEE Trans. Audio, Speech, Lang. Process, vol. 21, no. 1 pp. 99–109, Jan. 2013. [Google Scholar]
- [81].Huang P-S, Kim M, Hasegawa-Johnson M, and Smaragdis P, “Deep learning for monaural speech separation,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2014, pp. 1581–1585. [Google Scholar]
- [82].Huang P-S, Kim M, Hasegawa-Johnson M, and Smaragdis P, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 23, no. 12, pp. 2136–2147, Dec. 2015. [Google Scholar]
- [83].Hui L, Cai M, Guo C, He L, Zhang W-Q, and Liu J, “Convolutional maxout neural networks for speech separation,” in Proc IEEE Int. Symp. Signal Process. Inf. Technol., 2015, pp. 24–27. [Google Scholar]
- [84].Hummersone C, Stokes T, and Brooks T, “On the ideal ratio mask as the goal of computational auditory scene analysis,” in Blind Source Separation, Naik GR and Wang W, Eds. Berlin, Germany: Springer, 2014, pp. 349–368. [Google Scholar]
- [85].Hyvärinen A and Oja E, “Independent component analysis: Algorithms and applications,” Neural Netw, vol. 13, pp. 411–430, 2000. [DOI] [PubMed] [Google Scholar]
- [86].Ikbal S, Misra H, and Bourlard HA, “Phase autocorrelation (PAC) derived robust speech features,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2003, pp. II-133–II-136. [Google Scholar]
- [87].Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, ITU-T Recommendation P. 862, 2000. [Google Scholar]
- [88].Jarrett DP, Habets E, and Naylor PA, Theory and Applications of Spherical Microphone Array Processing. Zurich, Switzerland: Springer, 2016. [Google Scholar]
- [89].Jensen J and Taal CH, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, no. 11, pp. 2009–2022, Nov. 2016. [Google Scholar]
- [90].Jiang Y, Wang DL, Liu RS, and Feng ZM, “Binaural classification for reverberant speech segregation using deep neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 22, no. 12, pp. 2112–2121, Dec. 2014. [Google Scholar]
- [91].Jin Z and Wang DL, “A supervised learning approach to monaural segregation of reverberant speech,” IEEE Trans. Audio, Speech, Lang. Process, vol. 17, no. 4, pp. 625–638, May 2009. [Google Scholar]
- [92].Johnstone PM and Litovsky RY, “Effect of masker type and age on speech intelligibility and spatial release from masking in children and adults,” J. Acoust. Soc. Amer, vol. 120, pp. 2177–2189, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [93].Kidd G et al. , “Determining the energetic and informational components of speech-on-speech masking,” J. Acoust. Soc. Amer, vol. 140, pp. 132–144, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [94].Kim C and Stern RM, “Nonlinear enhancement of onset for robust speech recognition,” in Proc. INTERSPEECH, 2010, pp. 2058–2061. [Google Scholar]
- [95].Kim C and Stern RM, “Power-normalized cepstral coefficients (PNCC) for robust speech recognition,” in Proc. Int. Conf. Acoust., Speech Signal Process., 2012, pp. 4101–4104. [Google Scholar]
- [96].Kim D, Lee SH, and Kil RM, “Auditory processing of speech signals for robust speech recognition in real-world noisy environments,” IEEE Trans. Speech Audio Process, vol. 7, no. 1, pp. 55–69, Jan. 1999. [Google Scholar]
- [97].Kim G, Lu Y, Hu Y, and Loizou PC, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” J. Acoust. Soc. Amer, vol. 126, pp. 1486–1494, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [98].Kim M and Smaragdis P, “Adaptive denoising autoencoders: A fine-tuning scheme to learn from test mixtures,” in Proc. Int. Conf. Latent Var. Anal. Signal Separation, 2015, pp. 100–107. [Google Scholar]
- [99].Kjems U, Boldt JB, Pedersen MS, Lunner T, and Wang DL, “Role of mask pattern in intelligibility of ideal binary-masked noisy speech,”J. Acoust. Soc. Amer, vol. 126, pp. 1415–1426, 2009. [DOI] [PubMed] [Google Scholar]
- [100].Kolbak M, Tan ZH, and Jensen J, “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 25, no. 1, pp. 153–167, Jan. 2017. [Google Scholar]
- [101].Kolbak M, Yu D, Tan Z-H, and Jensen J, “Multi-talker speech separation with utternance-level permutation invariant training of deep recurrent neural networks” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 25, no. 10, pp. 1901–1913, Oct. 2017. [Google Scholar]
- [102].Kressner AA, May T, and Rozell CJ, “Outcome measures based on classification performance fail to predict the intelligibility of binary-masked speech,” J. Acoust. Soc. Amer, vol. 139, pp. 3033–3036, 2016. [DOI] [PubMed] [Google Scholar]
- [103].Kulmer J and Mowlaee P, “Phase estimation in single channel speech enhancement using phase decomposition,” IEEE Signal Proc. Lett, vol. 22, no. 5, pp. 598–602, May 2015. [Google Scholar]
- [104].Kumar K, Kim C, and Stern RM, “Delta-spectral cepstral coefficients for robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2011, pp. 4784–4787. [Google Scholar]
- [105].Le Roux J, Hershey J, and Weninger F, “Deep NMF for speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2015, pp. 66–70. [Google Scholar]
- [106].LeCun Y et al. , “Backpropagation applied to handwritten zip code recognition,” Neural Comput, vol. 1, pp. 541–551, 1989. [Google Scholar]
- [107].Lee Y-S, Yang C-Y, Wang S-F, Wang J-C, and Wu C-H, “Fully complex deep neural network for phase-incorporating monaural source separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 281–285. [Google Scholar]
- [108].Li B, Sainath TN, Weiss RJ, Wilson KW, and Bacchiani M, “Neural network adaptive beamforming for robust multichannel speech recognition,” in Proc. INTERSPEECH, 2016, pp. 1976–1980. [Google Scholar]
- [109].Li N and Loizou PC, “Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction,” J. Acoust. Soc. Amer, vol. 123, pp. 1673–1682, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [110].Liang S, Liu W, Jiang W, and Xue W, “The optimal ratio time-frequency mask for speech separation in terms of the signal-to-noise ratio,” J. Acoust. Soc. Amer, vol. 134, pp. 452–458, 2013. [DOI] [PubMed] [Google Scholar]
- [111].Licklider JCR, “A duplex theory of pitch perception,” Experientia, vol. 7, pp. 128–134, 1951. [DOI] [PubMed] [Google Scholar]
- [112].Lightburn L and Brookes M, “SOBM - a binary mask for noisy speech that optimises an objective intelligibility metric,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2015, pp. 661–665. [Google Scholar]
- [113].Loizou PC, Speech Enhancement: Theory and Practice, 2nd ed., Boca Raton, FL, USA: CRC Press, 2013. [Google Scholar]
- [114].Loizou PC and Kim G, “Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions,” IEEE Trans. Audio, Speech, Lang. Process, vol. 19, no. 1, pp. 47–56, Jan. 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [115].Lu X, Matsuda S, Hori C, and Kashioka H, “Speech restoration based on deep learning autoencoder with layer-wised pretraining,” in Proc. INTERSPEECH, 2012, pp. 1504–1507. [Google Scholar]
- [116].Lu X, Tsao Y, Matsuda S, and Hori C, “Speech enhancement based on deep denoising autoencoder,” in Proc. INTERSPEECH, 2013, pp. 555–559. [Google Scholar]
- [117].Lyon RF, “A computational model of binaural localization and separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 1983, pp. 1148–1151. [Google Scholar]
- [118].Lyon RF, Human and Machine Hearing. New York, NY, USA: Cambridge Univ. Press, 2017. [Google Scholar]
- [119].Maganti HK and Matassoni M, “An auditory based modulation spectral feature for reverberant speech recognition,” in Proc. INTERSPEECH, 2010, pp. 570–573. [Google Scholar]
- [120].Masutomi K, Barascud N, Kashino M, McDermott JH, and Chait M, “Sound segregation via embedded repetition is robust to inattention,” J. Exp. Psychol., Human Perception Perform, vol. 42, pp. 386–400, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [121].Meng Z, Watanabe S, Hershey J, and Erdogan H, “Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 271–275. [Google Scholar]
- [122].Michelsanti D and Tan Z-H, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in Proc. INTERSPEECH, 2017, pp. 2008–2012. [Google Scholar]
- [123].Miller GA, “The masking of speech,” Psychol. Bull, vol. 44, pp. 105–129, 1947. [DOI] [PubMed] [Google Scholar]
- [124].Miller GA and Heise GA, “The trill threshold,” J. Acoust. Soc. Amer, vol. 22, pp. 637–638, 1950. [Google Scholar]
- [125].Moghimi AR and Stern RM, “An analysis of binaural spectro-temporal masking as nonlinear beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 835–839. [Google Scholar]
- [126].Moore BCJ, An Introduction to the Psychology of Hearing, 5th ed San Diego CA, USA: Academic, 2003. [Google Scholar]
- [127].Moore BCJ, Cochlear Hearing Loss, 2nd ed Chichester, U.K.: Wiley, 2007. [Google Scholar]
- [128].Nair V and Hinton GE, “Rectified linear units improve restricted Boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 807–814. [Google Scholar]
- [129].Nakatani T, Ito M, Higuchi T, Araki S, and Kinoshita K, “Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 286–290. [Google Scholar]
- [130].Narayanan A and Wang DL, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2013, pp. 7092–7096. [Google Scholar]
- [131].Naylor PA and Gaubitch ND, Eds., Speech Dereverberation. London, U.K.: Springer, 2010. [Google Scholar]
- [132].Nie S, Zhang H, Zhang X, and Liu W, “Deep stacking networks with time series for speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 6717–6721. [Google Scholar]
- [133].Nugraha AA, Liutkus A, and Vincent E, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, no. 9, pp. 1652–1664, Sep. 2016. [Google Scholar]
- [134].Papadopoulos P, Tsiartas A, and Narayanan S, “Long-term SNR estimation of speech signals in known and unknown channel conditions,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 24, no. 12, pp. 2495–2506, Dec. 2016. [Google Scholar]
- [135].Parbery-Clark A, Skoe E, Lam C, and Kraus N, “Musician enhancement for speech-in-noise,” Ear Hearing, vol. 30, pp. 653–661, 2009. [DOI] [PubMed] [Google Scholar]
- [136].Park SR and Lee JW, “A fully convolutional neural network for speech enhancement,” in Proc. INTERSPEECH, 2016, pp. 1993–1997. [Google Scholar]
- [137].Pascanu R, Mikolov T, and Bengio Y, “On the difficulty of training recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 1310–1318. [Google Scholar]
- [138].Pascual S, Bonafonte A, and Serra J, “SEGAN: Speech enhancement generative adversarial network,” in Proc. INTERSPEECH, 2017, pp. 3642–3646. [Google Scholar]
- [139].Pfeifenberger L, Zohrer M, and Pernkopf F, “DNN-based speech mask estimation for eigenvector beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 66–70. [Google Scholar]
- [140].Rix A, Beerends J, Hollier M, and Hekstra A, “Perceptual evaluation of speech quality (PESQ)—a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2001, pp. 749–752. [Google Scholar]
- [141].Roman N, Wang DL, and Brown GJ, “Speech segregation based on sound localization,” J. Acoust. Soc. Amer, vol. 114, pp. 2236–2252, 2003. [DOI] [PubMed] [Google Scholar]
- [142].Rosenblatt F, Principles of Neural Dynamics. New York, NY, USA: Spartan, 1962. [Google Scholar]
- [143].Rumelhart DE, Hinton GE, and Williams RJ, “Learning internal representations by error propagation,” in Parallel Distributed Processing, Rumelhart DE and McClelland JL, Eds., Cambridge, MA, USA: MIT Press, 1986, pp. 318–362. [Google Scholar]
- [144].Russell S and Norvig P, Artificial Intelligence: A Modern Approach, 3rd ed Upper Saddle River, NJ, USA: Prentice-Hall, 2010. [Google Scholar]
- [145].Schadler MR, Meyer BT, and Kollmeier B, “Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition,” J. Acoust. Soc. Amer, vol. 131, pp. 4134–4151, 2012. [DOI] [PubMed] [Google Scholar]
- [146].Schmidhuber J, “Deep learning in neural networks: An overview,” Neural Netw, vol. 61, pp. 85–117, 2015. [DOI] [PubMed] [Google Scholar]
- [147].Seltzer ML, Raj B, and Stern RM, “A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition,” Speech Commun, vol. 43, pp. 379–393, 2004. [Google Scholar]
- [148].Shamma S, Elhilali M, and Micheyl C, “Temporal coherence and attention in auditory scene analysis,” Trends Neurosci, vol. 34, pp. 114–123, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [149].Shannon BJ and Paliwal KK, “Feature extraction from higher-order autocorrelation coefficients for robust speech recognition,” Speech Commun, vol. 48, pp. 1458–1485, 2006. [Google Scholar]
- [150].Shao Y, Srinivasan S, and Wang DL, “Robust speaker identification using auditory features and computational auditory scene analysis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2008, pp. 1589–1592. [Google Scholar]
- [151].Shinn-Cunningham B, “Object-based auditory and visual attention,” Trends Cogn. Sci, vol. 12, pp. 182–186, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [152].Srinivasan S, Roman N, and Wang DL, “Binary and ratio time-frequency masks for robust speech recognition,” Speech Commun, vol. 48, pp. 1486–1501, 2006. [Google Scholar]
- [153].Srivastava RK, Greff K, and Schmidhuber J, “Highway networks,” arXiv1505.00387. [Google Scholar]
- [154].Summers V and Leek MR, “F0 processing and the separation of competing speech signals by listeners with normal hearing and with hearing loss,” J. Speech, Lang., Hearing Res, vol. 41, pp. 1294–1306, 1998. [DOI] [PubMed] [Google Scholar]
- [155].Sun L, Du J, Dai L-R, and Lee C-H, “Multiple-target deep learning for LSTM-RNN based speech enhancement,” in Proc. Workshop Hands-free Speech Commun. Microphone Arrays, 2017, pp. 136–140. [Google Scholar]
- [156].Sundermeyer M, Ney H, and Schluter R, “From feedforward to recurrent LSTM neural networks for language modeling,” IEEE/ACM Trans. Audio, Speech, Lang. Process, vol. 23, no. 3, pp. 517–529, Mar. 2015. [Google Scholar]
- [157].Sutskever I, Vinyals O, and Le QV, “Sequence to sequence learning with neural networks,” in Proc. Neural Inf. Process. Syst, 2014, pp. 3104–3112. [Google Scholar]
- [158].Taal CH, Hendriks RC, Heusdens R, and Jensen J, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Lang. Process, vol. 19, no. 7, pp. 2125–2136, Sep. 2011. [DOI] [PubMed] [Google Scholar]
- [159].Tabri D, Chacra KM, and Pring T, “Speech perception in noise by monolingual, bilingual and trilingual listeners,” Int. J. Lang. Commun. Disorders, vol. 46, pp. 411–422, 2011. [DOI] [PubMed] [Google Scholar]
- [160].Tamura S and Waibel A, “Noise reduction using connectionist models,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 1988, pp. 553–556. [Google Scholar]
- [161].Tu M and Zhang X, “Speech enhancement based on deep neural networks with skip connections,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 5565–5569. [Google Scholar]
- [162].Tu Y, Du J, Xu Y, Dai L-R, and Lee C-H, “Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers,” in Proc. 9th Int. Symp. Chinese Spoken Lang. Process., 2014, pp. 250–254. [Google Scholar]
- [163].van Noorden LPAS, Temporal coherence in the perception of tone sequences. Ph.D. Dissertation, Eindhoven University of Technology, Eindhoven, The Netherlands, 1975. [Google Scholar]
- [164].van Veen BD and Buckley KM, “Beamforming: A versatile approach to spatial filtering,” IEEE ASSP Mag, vol. 5, no. 2, pp. 4–24, Apr. 1988. [Google Scholar]
- [165].Vincent E, Gribonval R, and Fevotte C, “Performance measurement in blind audio source separation,” IEEE Trans. Audio Speech Lang. Process, vol. 14, no. 4, pp. 1462–1469, Jul. 2006. [Google Scholar]
- [166].Virtanen T, Gemmeke JF, and Raj B, “Active-set Newton algorithm for overcomplete non-negative representations of audio,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 21, no. 11, pp. 2277–2289, Nov. 2013. [Google Scholar]
- [167].Vu TT, Bigot B, and Chng ES, “Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 499–503. [Google Scholar]
- [168].Wang DL, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, Divenyi P, Ed., Norwell, MA, USA: Kluwer Academic, 2005, pp. 181–197. [Google Scholar]
- [169].Wang DL, “The time dimension for scene analysis,” IEEE Trans. Neural Net, vol. 16, no. 6, pp. 1401–1426, Nov. 2005. [DOI] [PubMed] [Google Scholar]
- [170].Wang DL, “Time-frequency masking for speech separation and its potential for hearing aid design,” Trend. Amplif, vol. 12, pp. 332–353, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [171].Wang DL, “Deep learning reinvents the hearing aid,” IEEE Spectrum, pp. 32–37, Mar. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [172].Wang DL and Brown GJ, Ed., Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Hoboken, NJ, USA: Wiley, 2006. [Google Scholar]
- [173].Wang DL, Kjems U, Pedersen MS, Boldt JB, and Lunner T, “Speech intelligibility in background noise with ideal binary time-frequency masking,” J. Acoust. Soc. Amer, vol. 125, pp. 2336–2347, 2009. [DOI] [PubMed] [Google Scholar]
- [174].Wang Y, Du J, Dai L-R, and Lee C-H, “A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 25, no. 7, pp. 1535–1546, Jul. 2017. [Google Scholar]
- [175].Wang Y, Du J, Dai L-R, and Lee C-H, “A maximum likelihood approach to deep neural network based nonlinear spectral mapping for single-channel speech separation,” in Proc. INTERSPEECH, 2017, pp. 1178–1182. [Google Scholar]
- [176].Wang Y, Getreuer P, Hughes T, Lyon RF, and Saurous RA, “Trainable frontend for robust and far-field keyword spotting,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 5670–5674. [Google Scholar]
- [177].Wang Y, Han K, and Wang DL, “Exploring monaural features for classification-based speech segregation,” IEEE Trans. Audio Speech Lang. Process, vol. 21, no. 2, pp. 270–279, Feb. 2013. [Google Scholar]
- [178].Wang Y, Narayanan A, and Wang DL, “On training targets for supervised speech separation,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 22, no. 12, pp. 1849–1858, Dec. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [179].Wang Y and Wang DL, “Boosting classification based speech separation using temporal dynamics,” in Proc. INTERSPEECH, 2012, pp. 1528–1531. [Google Scholar]
- [180].Wang Y and Wang DL, “Cocktail party processing via structured prediction,” in Proc. Neural Inf. Process. Syst, 2012, pp. 224–232. [Google Scholar]
- [181].Wang Y and Wang DL, “Towards scaling up classification-based speech separation,” IEEE Trans. Audio Speech Lang. Process, vol. 21, no. 7, pp. 1381–1390, Jul. 2013. [Google Scholar]
- [182].Wang Y and Wang DL, “A deep neural network for time-domain signal reconstruction,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2015, pp. 4390–4394. [Google Scholar]
- [183].Wang Z-Q and Wang DL, “Phoneme-specific speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 146–150. [Google Scholar]
- [184].Wang Z, Wang X, Li X, Fu Q, and Yan Y, “Oracle performance investigation of the ideal masks,” in Proc. Int. Workshop Acoust. Echo Noise Control, 2016, pp. 1–5. [Google Scholar]
- [185].Weninger F et al. , “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Proc12th Int. Conf. Latent Var. Anal. Signal Separation, 2015, pp. 91–99. [Google Scholar]
- [186].Weninger F, Hershey J, Le Roux J, and Schuller B, “Discriminatively trained recurrent neural networks for single-channel speech separation,” in Proc. IEEE Global Conf., Signal Inf, Process., 2014, pp. 740–744. [Google Scholar]
- [187].Werbos PJ, “Backpropagation through time: What it does and how to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990. [Google Scholar]
- [188].Williamson DS, Wang Y, and Wang DL, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio Speech Lang. Proc, vol. 24, no. 3, pp. 483–492, Mar. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [189].Wolpert DH, “The lack of a priori distinction between learning algorithms,” Neural Comp, vol. 8, pp. 1341–1390, 1996. [Google Scholar]
- [190].Wu B, Li K, Yang M, and Lee C-H, “A reverberation-time-aware approach to speech dereverberation based on deep neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 25, no. 1, pp. 102–111, Jan. 2017. [Google Scholar]
- [191].Wu M and Wang DL, “A two-stage algorithm for one-microphone reverberant speech enhancement,” IEEE Trans. Audio Speech Lang. Process, vol. 14, no. 3, pp. 774–784, May 2006. [Google Scholar]
- [192].Xia S, Li H, and Zhang X, “Using optimal ratio mask as training target for supervised speech separation,” in Proc. Asia Pac. Signal Inf. Process. Assoc, 2017. [Google Scholar]
- [193].Xiao X, Zhao S, Jones DL, Chng ES, and Li H, “On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 3246–3250. [Google Scholar]
- [194].Xiao X et al. , “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation,” EURASIP J. Adv. Signal. Process, vol. 2016, pp. 1–18, 2016. [Google Scholar]
- [195].Xu Y, Du J, Dai L-R, and Lee C-H, “Dynamic noise aware training for speech enhancement based on deep neural networks,” in Proc. INTERSPEECH, 2014, pp. 2670–2674. [Google Scholar]
- [196].Xu Y, Du J, Dai L-R, and Lee C-H, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Process. Lett, vol. 21, no. 1, pp. 65–68, Jan. 2014. [Google Scholar]
- [197].Xu Y, Du J, Dai L-R, and Lee C-H, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 23, no. 1, pp. 7–19, Jan. 2015. [Google Scholar]
- [198].Xu Y, Du J, Huang Z, Dai L-R, and Lee C-H, “Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement,” in Proc. INTERSPEECH, 2015, pp. 1508–1512. [Google Scholar]
- [199].Yilmaz O and Rickard S, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. Signal Process, vol. 52, no. 7, pp. 1830–1847, Jul. 2004. [Google Scholar]
- [200].Yoshioka T et al. , “The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices,” in Proc. IEEE Workshop Automat. Speech Recognit. Understanding, 2015, pp. 436–443. [Google Scholar]
- [201].Yost WA, “The cocktail party problem: Forty years later,” in Binaural and Spatial Hearing in Real and Virtual Environments, Gilkey RH and Anderson TR, Ed., Mahwah, NJ, USA: Lawrence Erlbaum, 1997, pp. 329–347. [Google Scholar]
- [202].Yu D, Kolbak M, Tan Z-H, and Jensen J, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 241–245. [Google Scholar]
- [203].Yu Y, Wang W, and Han P, “Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks,” EURASIP J. Audio Speech Music Process, vol. 2016, pp. 1–18, 2016. [Google Scholar]
- [204].Yuo K-H and Wang H-C, “Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences,” Speech Comm, vol. 28, pp. 13–24, 1999. [Google Scholar]
- [205].Zhang H, Zhang X, and Gao G, “Multi-target ensemble learning for monaural speech separation,” in Proc. INTERSPEECH, 2017, pp. 1958–1962. [Google Scholar]
- [206].Zhang X-L and Wang DL, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Trans. Audio Speech Lang. Proc, vol. 24, no. 5, pp. 967–977, May 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [207].Zhang X-L and Wu J, “Deep belief networks based voice activity detection,” IEEE Trans. Audio Speech Lang. Process, vol. 21, no. 4, pp. 697–710, Apr. 2013. [Google Scholar]
- [208].Zhang X and Wang DL, “Deep learning based binaural speech separation in reverberant environments,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 25, no. 5, pp. 1075–1084, May 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [209].Zhang X, Wang Z-Q, and Wang DL, “A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 276–280. [Google Scholar]
- [210].Zhang X, Zhang H, Nie S, Gao G, and Liu W, “A pairwise algorithm using the deep stacking network for speech separation and pitch estimation,” IEEE/ACM Trans. Audio Speech Lang. Process, vol. 24, no. 6, pp. 1066–1078, Jun. 2016. [Google Scholar]
- [211].Zhao Y, Wang Z-Q, and Wang DL, “A two-stage algorithm for noisy and reverberant speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 5580–5584. [Google Scholar]