Skip to main content
Howard Hughes Medical Institute Author Manuscripts logoLink to Howard Hughes Medical Institute Author Manuscripts
. Author manuscript; available in PMC: 2010 Apr 27.
Published in final edited form as: IEEE Trans Audio Speech Lang Process. 2009 Jan 1;17(1):24–37. doi: 10.1109/TASL.2008.2005342

Speech Enhancement, Gain, and Noise Spectrum Adaptation Using Approximate Bayesian Estimation

Jiucang Hao 1, Hagai Attias 2, Srikantan Nagarajan 3, Te-Won Lee 4, Terrence J Sejnowski 5
PMCID: PMC2860321  NIHMSID: NIHMS192814  PMID: 20428253

Abstract

This paper presents a new approximate Bayesian estimator for enhancing a noisy speech signal. The speech model is assumed to be a Gaussian mixture model (GMM) in the log-spectral domain. This is in contrast to most current models in frequency domain. Exact signal estimation is a computationally intractable problem. We derive three approximations to enhance the efficiency of signal estimation. The Gaussian approximation transforms the log-spectral domain GMM into the frequency domain using minimal Kullback–Leiber (KL)-divergency criterion. The frequency domain Laplace method computes the maximum a posteriori (MAP) estimator for the spectral amplitude. Correspondingly, the log-spectral domain Laplace method computes the MAP estimator for the log-spectral amplitude. Further, the gain and noise spectrum adaptation are implemented using the expectation–maximization (EM) algorithm within the GMM under Gaussian approximation. The proposed algorithms are evaluated by applying them to enhance the speeches corrupted by the speech-shaped noise (SSN). The experimental results demonstrate that the proposed algorithms offer improved signal-to-noise ratio, lower word recognition error rate, and less spectral distortion.

Index Terms: Approximate Bayesian estimation, Gaussian mixture model (GMM), speech enhancement

I. INTRODUCTION

In real-world environments, speech signals are usually corrupted by adverse noise, such as competing speakers, background noise, or car noise, and also they are subject to distortion caused by communication channels; examples are room reverberation, low-quality microphones, etc. Other than specialized studios or laboratories when audio signal is recorded, noise is recorded as well. In some circumstances such as cars in traffic, noise levels could exceed speech signals. Speech enhancement improves the signal quality by suppression of noise and reduction of distortion. Speech enhancement has many applications; for example, mobile communications, robust speech recognition, low-quality audio devices, and hearing aids.

Because of its broad application range, speech enhancement has attracted intensive research for many years. The difficulty arises from the fact that precise models for both speech signal and noise are unknown [1], thus speech enhancement problem remains unsolved [2]. A vast variety of models and speech enhancement algorithms are developed which can be broadly classified into two categories: single-microphone class and multi-microphone class. While the second class can be potentially better because of having multiple inputs from microphones, it also involves complicated joint modeling of microphones such as beamforming [2]–[4]. Algorithms based on a single microphone have been a major research focus, and a popular subclass is spectral domain algorithms.

It is believed that when measuring the speech quality, the spectral magnitude is more important than its phase. Boll proposed the spectral subtraction method [5], where the signal spectra are estimated by subtracting the noise from a noisy signal spectra. When the noisy signal spectra fall below the noise level, the method produces negative values which need to be suppressed to zero or replaced by a small value. Alternatively, signal subspace methods [6] aim to find a desired signal subspace, which is disjoint with the noise subspace. Thus, the components that lie in the complementary noise subspace can be removed. A more general task is source separation. Ideally, if there exists a domain where the subspaces of different signal sources are disjoint, then perfect signal separation can be achieved by projecting the source signal onto its subspace [7]. This method can also be applied to the single-channel source separation problem where the target speaker is considered as signal and the competing speaker is considered as noise. Other approaches include algorithms based on audio coding algorithms [8], independent component analysis (ICA) [9], and perceptual models [10].

Performance of speech enhancement is commonly evaluated using some distortion measures. Therefore, enhanced signals can be estimated by minimizing its distortion, where the expectation value is utilized, because of the stochastic property of speech signal. Thus, statistical-model-based speech enhancement systems [11] have been particularly successful. Statistical approaches require prespecified parametric models for both the signal and the noise. The model parameters are obtained by maximizing the likelihood of the training samples of the clean signals using the expectation–maximization (EM) algorithm. Because the true model for speech remains unknown [1], a variety of statistical models have been proposed. Short-time spectral amplitude (STSA) estimator [12] and log-spectral amplitude estimator (LSAE) [13] assume that the spectral co-efficients of both signal and noise obey Gaussian distribution. Their difference is that STSA minimizes the mean square error (MMSE) of the spectral amplitude while the LSAE uses the MMSE estimator of the log-spectra. LSAE is more appropriate because log-spectrum is believed more suitable for speech processing. Hidden Markov model (HMM) is also developed for clean speech. The developed HMM with gain adaptation has been applied to the speech enhancement [14] and to the recognition of clean and noisy speech [15]. In contrast to the frequency-domain models [12]–[15], the density of log-spectral amplitudes is modeled by a Gaussian mixture model (GMM) with parameters trained on the clean signals [16]–[18]. Spectrally similar signals are clustered and represented by their mixture components. Though the quality of fitting the signal distribution using the GMM depends on the number of mixture components [19], the density of the speech log-spectral amplitudes can be accurately represented with very small number of mixtures. However, this approach leads to a complex model in the frequency domain and exact signal estimation becomes intractable; therefore, approximation methods have been proposed. The MIXMAX algorithm [16] simplifies the mixing process such that the noisy signal takes the maximum of either the signal or the noise, which offers a closed-form signal estimation. Linear approximation [17], [18] expands the logarithm function locally using Taylor expansion. This leads to a linear Gaussian model where the estimation is easy, although finding the point of Taylor expansion needs iterative optimization. The spectral domain algorithms offer high quality speech enhancement while remaining low in computational complexity.

In this paper, different from the frequency-domain models [12]–[15], we start with a GMM in the log-spectral domain as proposed in [16]–[18]. Converting the GMM in the log-spectral domain into the frequency domain directly produces a mixture of log-normal distributions which causes the signal estimation difficult to compute. Approximating the logarithm function [16]–[18] is accurate only locally for a limited interval and thus may not be optimal. We propose three methods based on Bayesian estimation. The first is to substitute the log-normal distribution by an optimal Gaussian distribution in the Kullback–Leiber (KL) divergence [20] sense. This way in the frequency domain, we obtain a GMM with a closed-form signal estimation. The second approach uses the Laplace method [21], where the spectral amplitude is estimated by computing the maximum a posteriori (MAP). The Laplace method approximates the posterior distribution by a Gaussian derived from the second-order Taylor expansion of the log likelihood. The third approach is also based on Laplace method, but the log-spectra of signals are estimated using the MAP. The spectral amplitudes are obtained by exponentiating their log-spectra.

The statistical approaches discussed above rely on parameters estimated from the training samples that reflect the statistical properties of the signal. However, the statistics of the test signals may not match those of the training signals perfectly. For example, movement of the speakers and changes of the recording conditions are causes of mismatches. Such difficulty can be overcome by introducing parameters that adapt to the environmental changes. Gain and noise adaptation partially solves this problem [14], [15]. Different from the aspect of audio gain estimation in [12], [22] the gain here means the energy of signals corresponding to the volume of the audio. In [17], noise estimation is proposed, but the gain is fixed to 1. We propose an EM algorithm with efficient gain and noise estimation under the Gaussian approximation.

The paper is organized as the follows. In Section II, speech and noise models are introduced. In Section III, the proposed algorithms are derived in detail. In Section IV, an EM algorithm for learning gain and noise spectrum under the Gaussian approximation is presented. Section V shows the experimental results and comparisons to other methods applied to enhance the speech corrupted by speech-shaped noise (SSN). Section VI concludes the paper.

Notations

We use X or x to denote the variables derived from the clean signal, Y or y to denote the variables derived from the noisy signal, and N or n to denote the variables derived from the noise. The small letters with square brackets, x[t], y[t], and n[t], denote time-domain variables. The capital letters, Xk, Yk, and Nk, denote the fast Fourier transform (FFT) coefficients, the small letters, xk, yk, and zk, denote the log-spectral amplitudes, and the letters with superscript c, xkc,ykc,andnkc, denote the cepstral coefficients. The subindex k is the frequency bin index. H denotes the gain and H* denotes its complex conjugate. 𝒩(x | μ, ν) denotes the Gaussian distribution with mean μ and precision ν, which is defined as the inverse of covariance ν = 1/E{(x − μ)2}. The small letter s denotes the mixture component (state index). μk and Bk denote the mean and the precision of the distribution for the clean signal log-spectrum xk, and Γ = diag(γ1,…, γK) denotes the precision of the distribution for the noise FFT coefficients.

II. PRIOR SPEECH MODEL AND SIGNAL ESTIMATION

A. Signal Representations

Let x[t] be the time-domain signal. The FFT1 coefficients Xk can be obtained by applying the FFT on the segmented and windowed signal x[t]. The log-spectral amplitude is computed as the logarithm of the magnitude of the FFT coefficients, xk = log(|Xk|2). The cepstral coefficients xkc are computed by taking the inverse FFT (IFFT2) on the log-spectral amplitudes xk. Fig. 1 shows the relationship among different domains. Note that for the FFT coefficients, the kth component Xk is the complex conjugate of XK−k. Thus, we only need to keep the first K/2 + 1 components, because the rest provides no additional information, and IFFT contains the same property. Due to this symmetry, the cepstral coefficients xktc are real.

Fig. 1.

Fig. 1

Diagram for the relationship among the time domain, the frequency domain, the log-spectral domain, and the cepstral domain.

B. Speech and Noise Models

We consider the clean signal x[n] is contaminated by statistically independent and zero mean noise n[t] in the time domain. Under the assumption of additive noise, the observed signal can be described by

y[t]=h[t]*x[t]+n[t]=mhmx[tm]+n[t] (1)

where h[t] is the impulse response of the filter and * denotes convolution. Such signal is often processed in frequency domain by applying FFT

Yk=HkXk+Nk (2)

where k denotes the frequency bin and Hk is the gain. In this paper, we will focus on stationary channel, where Hk is time-independent.

Statistical models characterize the signals by its probability density function (pdf). The GMM, provided a sufficient number of mixtures, can approximate any given density function to arbitrary accuracy, when the parameters (weights, means, and covariances) are correctly chosen [19, p. 214]. The number of parameters for GMM is usually small and can be reliably estimated using the EM algorithm [19]. Here, we assume the log-spectral amplitudes {x0,…,xK−1} obey a GMM

p(x)=sp(x|s)p(s)=sk𝒩(xk|μks,Bks)p(s) (3)

where s is the state of the mixture component. For state s, 𝒩(xk | μks, Bks) denotes a Gaussian with mean μks and precision Bks defined as the inverse of the covariance

𝒩(xk|μks,Bks)=|Bks2π|eBks2(xkμks)2. (4)

Though each frequency bin is statistically independent for state s, they are dependent overall because the marginal density p(x) does not factorize.

Use the definition of log-spectrum xk = log(|Xk|2), Xk can be written as Xk=Xk+iXk,whereXk=exk/2cosθkandXk=exk/2sinθk are its real part and imaginary part, θk is its phase. Assume that the phase is uniformly distributed pk) = (1/(2π)), and the pdf for xk is given in (4), we compute the pdf for the FFT coefficients as

p(Xk|s)=p(Xk,Xk|s)=|(Xk,Xk)(xk,θk)|1p(xk|s)p(θk)=1π|Xk|2𝒩(log(|Xk|2)|μks,Bks)=1π|Xk|2Bks2πeBks2(log(|Xk|2)μks)2 (5)

where the Jacobian |((Xk,Xk))/((xk,θk))|=exk/2=|Xk|2/2. We call this density log-normal, because the logarithm of a random variable obeys a normal distribution. The frequency-domain model is preferred compared to the log-spectral domain because of simple corruption dynamics in (2).

We consider a noise process independent on the signal and assume the FFT coefficients obey a Gaussian distribution with zero mean and precision matrix Γ = diag(γ1,…,γK)

p(N)=p(Y|X)=k𝒩(YkHkXk|0,γk)=kγkπeγk|YkHkXk|2. (6)

Note that this Gaussian density is for the complex variables. The precisions γk satisfy γk = 1/E{|YkHkXk|2}. In contrast, (4) is Gaussian density for the log-spectrum xk which is a real random variable.

The parameters μks, Bks, and p(s) of speech model given in (3) are estimated from the training samples using an EM algorithm. The details for EM algorithm can be found in [19]. The precision matrix Γ = diag(γ1,…,γK) of the noise model can be estimated from either pure noise or the noisy signals.

C. Signal Estimation

Under the assumption that the noise is independent on the signal, the full probabilistic model is

p(Y,X,s)=p(Y|X)p(X|s)p(s). (7)

Signal estimation is done as a summation of the posterior distributions of a signal

p(X|Y)=sp(X|Y,s)p(s|Y). (8)

For example, the MMSE estimator of a signal is given by

X^=sXp(X|Y,s)dXp(s|Y)=sX^sp(s|Y). (9)

where s is the signal estimator for state s. This signal estimator makes intuitive sense. Each mixture component enhances the noisy signal separately. Because the hidden state is unknown, the MMSE estimator consists of the average of the individual estimators s, weighted by the posterior probability p(s | Y). The block diagram is shown in Fig. 2.

Fig. 2.

Fig. 2

Block diagram for speech enhancement based on mixture models. Each mixture component enhances the signal separately. The signal estimator is computed by the summation of individual estimator weighted by its posterior probability p(s | y).

The MMSE estimator suggests a general signal estimation method for the mixture models. First, an estimator based on each mixture state s is computed. Then the posterior state probability p(s | Y) is calculated to reflect the contribution from state s. Finally, the system output is the summation of the estimators for the states, weighted by the posterior state probability. However, such a straightforward scheme cannot be carried out directly for the model considered. Neither the individual estimator s nor the posterior state probability p(s | Y) is easy to compute. The difficulty originates from the log-normal distributions for speech in the frequency domain. We propose approximations to compute both terms. Because we assume a diagonal precision matrix for Bs in the GMM, s can be estimated separately for each frequency bin k.

III. SIGNAL ESTIMATION BASED ON APPROXIMATE BAYESIAN ESTIMATION

Intractability often limits the application of sophisticated models. A great amount of research has been devoted to develop accurate and efficient approximations [20], [21]. Although there are popular methods that have been applied successfully, the effectiveness of such approximations is often model dependent. As indicated in (9), two terms, s and p(s | Y), are required. Three algorithms are derived to estimate both terms. One is based on Gaussian approximation. The other two methods are based on Laplace methods in the time-frequency domain and the log-spectral domain.

A. Gaussian Approximation (Gaussian)

As shown in Section II-B, the mixture of log-normal distributions for FFT coefficients makes the signal estimation difficult. If we substitute the log-normal distribution p(X | s) in (5) by a Gaussian for each state s, the frequency domain model becomes a GMM, which is analytically tractable.

For each state s, we choose the optimal Gaussian that minimizes the KL divergence DKL [23]

q=argminqDKL(pq)=argminqp(X)logp(X)q(X)dX (10)

where DKL is non-negative and equals to zero if and only if p equals to q almost surely. Note that DKL is asymmetric about its arguments p and q, and DKL(pq) is chosen because a closed-form solution for q exists.

It can be shown that the optimal Gaussian q that minimizes the KL-divergence having mean and covariance corresponding to those of the conditional probability in state s, p(Xk | s). The mean of p(Xk | s) is zero due the assumption of a uniform phase distribution. The second-order moments are

λks=|Xk|2p(Xk|s)dXk=exp[μks+1/(2Bks)]. (11)

The Gaussian q(Xk | s) = 𝒩(Xk | 0, 1/λks) minimizes DKL.

Under the Gaussian approximation, we have converted the GMM in log-spectral domain into a GMM in frequency domain. We denote this converted GMM by q(X)

q(X)=skq(Xk|s)p(s)=sk𝒩(Xk|0,1/λks)p(s). (12)

This approach avoids the complication from the log-normal distribution and offers efficient signal enhancement.

Under the assumption of a Gaussian noise model in (6), the posterior distribution over X for state s is computed as

p(Xk|Yk,s)=p(Yk|Xk)q(Xk|s)p(Yk|s)=𝒩(Xk|X^ks,ϕks). (13)

It is a Gaussian with precision ϕks and mean ks given by

ϕks=λks1+γk (14)
X^ks=γkϕksYk (15)

where λks is the covariance of the speech prior and γk is the precision of noise pdf. Note that we have used the approximated speech q(Xk | s) prior in (13). The individual signal estimator for each state s is given by (15).

The posterior state probability p(s | Y) is computed

p(s|Y)=p(Y|s)p(s)p(Y) (16)

using the Bayes’ rule. Under the speech prior q(X | s) in (12), p(Y | s) is computed as

p(Y|s)=kp(Yk|Xk)q(Xk|s)dXk=k𝒩(Yk|0,ψks) (17)

where the precision ψks is given by

ψks=1λks+1/γk (18)

Using (9) and substituting ks in (15), p(s | Y) in (16), the signal estimation function can be written as

X^k=sX^ksp(s|Y)=(sγkϕksp(s|Y))Yk. (19)

Each individual estimator has resembled the power response of a Wiener filter and is a linear function of Y. Note that the state probability depends on Y; therefore, the signal estimator in (19) is a nonlinear function of Y. This is analogous to a time-varying Wiener filter where the signal and noise power is known or can be estimated from a short period of the signal such as using a decision directed estimation approach [12], [22]. Here, the temporal variation is integrated through the changes of the posterior state probability p(s | Y) over time.

B. Laplace Method in Frequency Domain (LaplaceFFT)

The Laplace method approximates a complicated distribution using a Gaussian around its MAP. This method suggests the MAP estimator for the original distribution which is equivalent to the more popular MMSE estimator of the resulted Gaussian. Computing the MAP can be considered as an optimization problem and many optimization tools can be applied. We use the Newton’s method to find the MAP. The Laplace method is also applied to compute the posterior state probability which requires an integration over a hidden variable X. It expands the logarithm of the integrand around its mode using Taylor series expansion, and transforms the process into a Gaussian integration which has a closed-form solution. However, such a method for computing the posterior state probability is not accurate for our problem and we use an alternative approach. The final signal estimator is constructed using (9).

We derive the MAP estimator ks for each state s. The logarithm of the posterior signal pdf, conditioned on state s, is given by

logp(Xk|Yk,s)=logp(Yk|Xk,s)+logp(Xk|s)+c=γk|YkXk|2+log1π|Xk|2Bks2(log|Xk|2μks)2+c (20)

where c is a constant independent on Xk. It is more convenient to represent Xk using its magnitude rk and phase θk, Xk = rkek, and we compute the MAP estimator for the magnitude rk and phase θk for each state s

(r^ks,θ^ks)=arg max rk,θk{logp(rk,θk|Yk,s)}=arg max rk,θk{logrkp(Xk|Yk,s)}. (21)

Using (20) and neglecting the constant c, maximizing (21) is equivalent to minimizing the function h1 defined by

h1(rk,θk)=γk|Ykrkeiθk|2+Bks2(log(rk2)βks)2 (22)

where βks = μks − 1/(2Bks). It is obvious from the above equation that the MAP estimator for θk is θ̂k = ∠Yk, which is independent on state s, and the magnitude estimator ks minimizes

h1(rk)=γk|rykrk|2+Bks2(log(rk2)βks)2 (23)

where ryk = |Yk|. The minimization over rk does not have an analytical solution, but it can be solved with the Newton’s method. For this, we need the first-order and second-order derivatives of h1(rk) with respect to rk

h1(rk)=2γk(rkryk)+Bks(log(rk2)βks)2rk (24)
h1(rk)=2γk+Bks4rk2Bks(log(rk2)βks)2rk2. (25)

Then, the Newton’s method iterates

r^ksr^ksηh1(r^ks)|h1(r^ks)|. (26)

The absolute value of h1 indicates the search of the minima of h1. The η = 1 denotes the learning rate.

Newton’s method is sensitive to the initialization and may give local minima. The two squared terms in (23) indicate that the optimal estimator ks is bounded between eβks/2 and ryk. We use both values to initialize ks and select the one that produces a smaller h1(rk). Empirically, we observe that this scheme always finds a global minimum. The first term in (23) is quadratic; thus, Newton’s method converges to the optimal solution faster, less than five iterations for our case, than other methods such as gradient decent.

Computing the posterior state probability p(s | Y) requires the knowledge of p(Yk | s). Marginalization over Xk gives

p(Yk|s)=p(Yk|Xk)p(Xk|s)dXk. (27)

However, because of the log-normal distribution p(Xk | s) provided in (5), the integration cannot be solved with a closed-form answer. Either numerical methods or approximations are needed. Numerical integration is computationally expensive, leaving the approximation more efficient. We propose the following two approaches based on the Laplace method and Gaussian approximation.

1) Evaluate p(s | Y) Using the Laplace Method

The Laplace method is widely used to approximate integrals with continuous variables in statistical models to facilitate probabilistic inference [21] such as computing the high order statistics. It expands the logarithm of the integrand up to its second order, leading to a Gaussian integral which has a closed-form solution. We rewrite (27) as

p(Yk|s)=γkπBks2πexp(f(Xk)βks)dXk (28)

where we define

f(Xk)=γk|YkXk|2+Bks2(log(|Xk|2)αks)2 (29)

and αks = μks − 1/Bks, βks = μks − 1/(2Bks). The Laplace method expands the logarithm of the integrand f(Xk) around its minimum ks up to the second order and carries out a Gaussian integration

ef(Xk)dXkef(X^ks)|2πJ| (30)

where J is the Hessian of f (Xk) evaluated at ks. Denote X^ks=X^ks+iX^ks by its real part X^ks and imaginary part X^ks, its magnitude by ks = |ks|. J is computed as

J=(2fXX2fXX2fXX2fXX) (31)
=(ak+4X^k2r^ks2bk4X^kX^kr^ks2bk4X^kX^kr^ks2bkak+4X^k2r^ks2bk). (32)

The ak and bk here are defined as

ak=2γk+Bks(log(r^ks2)αks)2r^ks2 (33)
bk=BksBks(log(r^ks2)αks)r^ks2. (34)

The determinant of Hessian J is

det(J)=ak2+4akbk. (35)

Thus, the marginal probability is

p(Yk|s)|Bks|eβksef(X^ks)|1det(J)|. (36)

This gives p(s | Y)

p(s|Y)=p(Y|s)p(s)p(Y)kp(Yk|s)p(s). (37)

The Laplace method in essence approximates the posterior p(Xk | Yk, s) using a Gaussian density. This is very effective in Bayesian networks, where the training set includes a large number of samples. The posterior distribution of the (hyper-) parameters has a peaky shape that closely resembles a Gaussian. The Laplace method has an error that scales as O(T−1), where T is the number of samples [21]. However, the estimation here is based on a single sample Y. Further, the normalization factor of p(Yk | s) in (36) depends on the state s, but it is ignored. Thus, this approach does not yield good experimental results and we derive another method.

2) Evaluate p(s | Y) Using Gaussian Approximation

As discussed in Section III-A, the log-normal distribution p(Xk | s) has a Gaussian approximation q(Xk | s) = 𝒩(Xk | 0, 1/λks) given in (12). Thus, we can compute the marginal distribution p(Yk | s) for state s as

p(Yk|s)=p(Yk|Xk)p(Xk|s)dXkp(Yk|Xk)q(Xk|s)dXk=𝒩(0,ψks) (38)

where the precision ψks is given in (18). The posterior state probability p(s | Y) is obtained using the Bayes’ rule. It is

p(s|Y)=kp(Yk|s)p(s)p(Y). (39)

This approach uses the same procedure shown in Section III-A.

The signal estimator is the summation of the MAP estimator kseiYk for each state s weighted by the posterior state probability p(s | Y) in (39)

Xk=sr^kseiYkp(s|Y). (40)

The MAP estimator for phase, ∠Yk, is utilized.

C. Laplace Method in Log-Spectral Domain (LaplaceLS)

It is suggested that the human auditory system perceives a signal on the logarithmic scale, therefore log-spectral analysis such as LSAE [13] is more suitable for speech processing. Thus, we can expect better performance if the log-spectra can be directly estimated. The idea is to find the log-amplitude υ̂k = log(|Xk|2) that maximizes the log posterior probability log (p(Xk | Yk, s)) given in (20). Note that υ̂k is not the MAP of p(log(|Xk|2) | Yk, s). A similar case is LSAE [13], where the expectation of the log-spectral error is taken over p(X) rather than p(log |X|). Optimization over υk also has the advantage of avoiding negative amplitude due to local minima.

Substituting υk = log(|Xk|2) into (20), we compute the MAP estimator for the phase and log-amplitude υk. Note that the optimal phase is that of the noisy signal, θ̂k = ∠Yk. The MAP estimator for the log-amplitude maximizes (20), which is equivalent to minimizing

h2(υk)=γk(rykeυk/2)2+υk+Bks2(υkμks)2 (41)

where ryk = |Yk|, and h2 can be minimized using Newton’s method. The first- and second-order derivatives are given by

h2(υk)=γk(rykeυk/2)eυk/2+1+Bks(υkμks) (42)
h2(υk)=12γk(rykeυk/2)eυk/2+12γkeυk+Bks. (43)

The Newton’s method updates the log-amplitude υks as

υ^ksυ^ksηh2(υ^ks)|h2(υ^ks)|+τ (44)

where η is the learning rate, and τ is the regularization to avoid divergence when h2 is close to zero. This avoids the numerical instability caused by the exponential term in (41).

In the experiment, we use the noisy signal log-spectra for initialization, υ̂ks = log(|Yk|2). We set η = 0.5, τ =3, and run ten Newton’s iterations.

We use the same strategy as described in Section III-B.2 to compute p(s | Y) using (39). The signal estimator follows

υ¯k=sυ^ksp(s|Y) (45)
Xk=exp(υ¯k/2)eiYk. (46)

The MAP estimator of phase from the noisy signal is used.

In contrast to (40), where the amplitude estimators are averaged, (45) provides the log-amplitude estimator. The magnitude is obtained by taking the exponential. The exponential function is convex; thus, (45) provides a smaller magnitude estimation than (40) when eυ̂ks/2 = ks. Furthermore, this log-spectral estimator fits a speech recognizer, which extracts the Mel frequency cepstral coefficients (MFCCs).

IV. LEARNING GAIN AND NOISE WITH GAUSSIAN APPROXIMATION

One drawback of the system comes from the assumption that the statistical properties of the training set match those of the testing set, which means a lack of adaptability. However, the energy of the test signals may not be reliably estimated from a training set because of uncontrolled factors such as variations of the speech loudness or the distance between the speaker and microphone. This mismatch results in poor enhancement because the pretrained model may not capture the statistics of samples under the testing conditions. One strategy to compensate for these variations is to estimate the gain H instead of a fixed value of 1 used in the previous sections. Two conditions will be considered: frequency independent gain, which is a scalar gain and frequency dependent gain. Gain-adaptation needs to carry out efficiently. For the signal prior given in (3), it is difficult to estimate the gain because of the involvement of log-normal distributions. See Section II-B. However, under Gaussian approximation, the gain can be estimated using the EM algorithm.

Recall that the acoustic model is Yk = HkXk + Nk as given in (2). If p(Xk) has the form of GMM and p(Nk) is Gaussian, the model becomes exactly a mixture of factor analysis (MFA) model. The gain H can be estimated in the same way as estimating a loading matrix for MFA. For this purpose, we take the approach in Section III-A and approximate the log-normal pdf p(Xk | s) by a normal distribution q(Xk | s) = 𝒩(Xk | 0, 1/λks), where the signal covariance λks is given in (11). In addition, we assume additive Gaussian noise as provided in (6). Treating Xk as a hidden variable, we derive an EM algorithm, which contains an expectation step (E-step) and a maximization step (M-step), to estimate the gain Hk and the noise spectrum Γ = diag(γ1,…,γK).

A. EM Algorithm for Gain and Noise Spectrum Estimation

The data log-likelihood denoted by ℒ is

=tlogp(Yt)=tlog(stp(Yt,Xt,st)dXt)  tstq˜(Xt,st)[logp(Yt,Xt,st)logq˜(Xt,st)]dXt

where t is the frame index. The above inequality is true for all choices of the distribution (Xt, st). When (Xt, st) equals the posterior probability p(Xt, st | Yt), the inequality becomes an equality. The EM algorithm is a typical technique to maximize the likelihood. It iterates between updating the auxiliary distribution (Xt, st) (E-step) and optimizing the model parameters {H, Γ} (M-step), until some convergence criterion is satisfied.

The E-step computes the posterior distribution over Xt, (Xt | st) = p(Xt | Yt, st) = Πk p(Xkt | Ykt, st) with gain H fixed. And p(Xkt | Ykt, st) is computed as

p(Xkt|Ykt,st)=p(Ykt|Xkt)q(Xkt|st)p(Ykt|st). (47)

Note we use the approximated signal prior q(Xkt | st) given in (12). Thus, the computation is a standard Bayesian inference in a Gaussian system, and one can show that p(Xkt | Ykt, st) = 𝒩(Xkt | kst, Σks), whose mean kst and precision Σks are given by

Σks=Hk2γk+1/λks (48)
X˜kst=γkHk*YktΣks. (49)

Here, H* denotes the complex conjugate of H. We point out that the precisions are time-independent while the means are time dependent.

The posterior state probability (st) = p(st | Yt) is computed as

q˜(st)=p(st|Yt)=p(Yt|st)p(St)p(Yt),k𝒩(Ykt|0,1Hk2λks+1/γk)p(st). (50)

The M-step updates the gain H and noise spectrum Γ = diag(γ1,…,γK) with fixed. Now we consider two conditions: frequency-dependent gain and frequency-independent gain.

Frequency Independent Gain

H is scalar, its update rule is

H=t,st,kq˜(st)γkYktX˜kstt*t,st,kq˜(st)γk(X˜ksttX˜kstt*+Σks1). (51)

Frequency Dependent Gain

H = {H1,…, HK} is a vector. The update rule is, for k = {1,…, K},

Hk=t,stq˜(st)YktX˜kstt*t,stq˜(st)(X˜ksttX˜kstt*+Σks1). (52)

The update rule for the precision of noise γk is

1/γk=1Tt,stq˜(Xkt,st|Yt)|YktHkXkt|2dXkt. (53)

The goal of the EM algorithm is to provide an estimation for the gain and the noise spectrum. Note that it is not necessary to compute the intermediate results kstt in every iteration. Thus, substantial computation can be saved if we substitute (49) into the learning rules. This significantly improves the computational efficiency and saves memory. After some mathematical manipulation, the EM algorithm for the frequency dependent gain is as follows.

  1. Initialize Hk and γk.

  2. Compute (st) using (50).

  3. Update the precisions Σks using (48).

  4. Update the gain
    Hktstq˜(st)Σks1Hkγk|Ykt|2tstq˜(st)((Σks1γk)2Hk2|Ykt|2+Σks1). (54)
  5. Update the noise precision
    1γk1Ttstq˜(st)((1Σks1γkHk2)|Ykt|2+Σks1Hk2). (55)
  6. Iterate step 2), 3), 4), and 5) until convergence.

For frequency-independent gain, the gain is updated as follows:

Htstkq˜(st)Σks1Hkγk2|Ykt|2tstkq˜(st)γk((Σks1γk)2Hk2|Ykt|2+Σks1). (56)

The block diagram is shown in Fig. 3. In the above EM algorithm, Σks is time independent; thus, it is computed only once for all the frames, and |Ykt|2 is computed in advance.

Fig. 3.

Fig. 3

Block diagram of EM algorithm for the gain and noise spectrum estimation. The E-step, computing p(X, s | Y, H), and M-step, updating H and Γ, iterate until convergence.

In our experiment, because the test files are 1–2 seconds long segments, the parameters can not be reliably learned using a single segment. Thus, we concatenate four segments as a testing file. The gain is initialized to be 1. The noise covariance is initialized to be 30% of the signal covariance for all signal-to-noise ratio (SNR) conditions, which does not include any prior SNR knowledge. Because the EM algorithm for estimating the gain and noise is efficient, we set strict convergence criteria: a minimum of 100 EM iterations, the change of likelihood less than 1 and the change of gain less than 10−4 per iteration.

B. Identifiability of Model Parameters

The MFA is not identifiable because it is invariant under the proper rescaling of the parameters. However, in our case, the parameters H and Γ are identifiable, because the model for speech, a GMM trained by clean speech signals, remains fixed during the learning of parameters. The fixed speech prior removes the scaling uncertainty of the gain H. Second, the speech model is a GMM while the noise is modeled by a single Gaussian. The structure of speech, captured by the GMM through its higher order statistics, does not resemble a single Gaussian. This makes the noise spectrum Γ identifiable. As shown in our experiments, the gain H and noise spectrum Γ are reliably estimated using the EM algorithm.

V. EXPERIMENTS AND RESULTS

We evaluate the performances of the proposed algorithms by applying them to enhance the speeches corrupted by various levels of SSN. The SNR, spectral distortion (SD), and word recognition error rate serve as the criteria to compare them with the other benchmark algorithms quantitatively.

A. Task and Dataset Description

For all the experiments in this paper, we use the materials provided by the speech separation challenge [24]. This data set contains six-word sentences from 34 speakers. The speech follows the sentence grammar, 〈$command〉 〈$color〉 〈$preposition〉 〈$letter〉 〈$number〉 〈$adverb〉. There are 25 choices for the letter (a–z except w), ten choices for the number (0–9), four choices for the command (bin, lay, place, set), four choices for the color (blue, green, red, white), four choices for the preposition (at, by, in, with), and four choices for the adverb (again, now, please, soon). The time-domain signals are sampled at 25 kHz. Provided with the training samples, the task is to recover speech signals and recognize the key words (color, letter, digit) in the presence of different levels of SSN. Fig. 4 shows the speech and the SSN spectrum averaged over a segment under 0-dB SNR. The average spectra of the speech and the noise have the similar shape; hence, the name speech-shaped noise. The testing set includes the noisy signals under four SNR conditions, −12 dB, −6 dB, 0 dB, and 6 dB, each consisting of 600 utterances from 34 speakers.

Fig. 4.

Fig. 4

Plot of SSN spectrum (dotted line) and speech spectrum (solid line) averaged over one segment under 0-dB SNR. Note the similar shapes.

B. Training the Speech Model

The training set consists of clean signal segments that are 1–2 seconds long. They are used to train our prior speech model. To obtain a reliable speech model, we randomly concatenate 2 minutes of signals from the training set and analyze them using Hanning windows, each of size 800 samples and overlapping by half of the window. Frequency coefficients are obtained by performing a 1024 points FFT to the time-domain signals. Coefficients in the log-spectral domain are obtained by taking the logarithm of the magnitude of the FFT coefficients. Due to FFT/IFFT symmetry, only the first 513 frequency components are kept. Cepstral coefficients are obtained by applying IFFT on the log-spectral amplitudes.

The speech model for each speaker is a GMM with 30 states in the log-spectral domain. First, we take the first 40 cepstral coefficients and apply a k-mean algorithm to obtain k = 30 clusters. Next, the outputs of the k-mean clustering are used to initialize the GMM on those 40 cepstral coefficients. Then, we convert the GMM from the cepstral domain into the log-spectral domain using FFT. Finally, the EM algorithm initialized by the converted GMM is used to train the GMM in the log-spectral domain. After training, this log-spectral domain GMM with 30 states for speech is fixed when processing the noisy signals.

C. Benchmark Algorithms for Comparison

In this section, we present the benchmark algorithms with which we compare the proposed algorithms: the Wiener filter, the perceptual model [10], the linear approximation [17], [18], and the model based on super Gaussian prior [25]. We assume that parameters of the model for noise are available, and they are estimated by concatenating 50 segments in the experiment.

1) Wiener Filter (Wiener)

Time-varying Wiener filter assumes that both of the signal and noise power are known, and they are stationary for a short period of time. In the experiment, we first divide the signals into frames of 800 samples long with half overlapping. Both speech and noise are assumed to be stationary within each frame. To estimate speech and noise power, for each frame, the 200-sample-long subframes are chosen with half overlapping. On the subframes, Hanning windows are applied. Then, 256 points FFT are performed on those subframes to obtain the frequency coefficients. The power of signal within each frame t for frequency bin k, denoted by Ptkx, is computed by averaging the power of FFT coefficients over all the sub-frames that belong to the frame t. The same method is used to compute the noise power denoted by Ptkn. The signal estimation is computed as

Xtjk=PtkxPtkx+PtknYtjk (57)

where j is the subframe index and k denotes the frequency bins. After IFFT, in the time domain, each frame can be synthesized by overlap-adding the subframes, and the estimated speech signal is obtained by overlap-adding the frames.

Because the signal and noise powers are derived locally for each frame from the speech and noise, the Wiener filter contains strong speech prior in detail. Its performance can be regarded as a sort of experimental upper bound for the proposed methods.

2) Perceptual Model (Wolfe)

The perceptually motivated noise reduction technique can be seen as a masking process. The original signal is estimated by applying some suppression rules. For comparison, we use the method described in [10]. The algorithm estimates the spectral amplitude by minimizing the following cost function:

C(a^k,ak)={(a^kakmk2)2(mk2)2,     if|a^kakmk2|>mk20,    otherwise. (58)

where âk is the estimated spectral amplitude, and ak is the true spectral amplitude. This cost function penalizes the positive and negative errors differently, because positive estimation errors are perceived as additive noise and negative errors are perceived as signal attenuation [10]. The stochastic property of speech is that real spectral amplitude is unavailable; therefore, âk is computed by minimizing the expected cost function

a^k=arg min a^kC(a^k,ak)p(αk,ak|Yk)dαkdak (59)

where αk is the phase, and pk, ak | Yk) is the posterior signal distribution. Details of the algorithm can be found in [10]. The MATLAB code is available online [26]. The original code adds synthetic white noise to the clean signal, we modified it to add SSN to corrupt a speech at different SNR levels.

The reason we chose this method is because we hypothesize that this spectral analysis-based approach fails to enhance the SSN corrupted speech, due to the spectral similarity between the speech and noise as shown in Fig. 4. This method, motivated from a different aspect by human perception, also serves as a benchmark with which we can compare our methods.

3) Linear Approximation (Linear)

It can be shown that the relationship among the log-spectra of the signal xk, the noisy signal yk, and the noise nk is given by [17], [18]

yk=xk+log(1+exp(nkxk))+ϵk (60)

where ϵk is an error term.

The speech model remains the same which is GMM given by (3), but the noise log-spectrum n has a Gaussian density with the mean ρ and precision D, while the error term ϵ obeys a Gaussian with zero-mean and precision R

p(n)=𝒩(n|ρ,D)=k𝒩(nk|ρk,Dk) (61)
p(ϵ)=𝒩(ϵ|0,R)=k𝒩(ϵk|0,Rk). (62)

This essentially assumes a log-normal pdf for the noise FFT coefficients, in contrast to the noise model in (6).

Linear approximation to (60) has been proposed in [17] and [18] to enhance the tractability. Note that there are two hidden variables xk and nk due to the error term ϵk. Let zk = (xk, nk)T. Define g(zk) = xk + log(1+exp(nkxk)) and its derivatives gx(zk)=(g)/(xk)=(1/(1+exp(nkxk))),gn(zk)=(g)/(nk)=(1/(1+exp(xknk))),g(zk)=(gx(zk),gn(zk))T. Using (60) and expanding g(zk) around ks = (ks, ñks)T linearly, yk becomes a linear function of zk

ykl(zk)+ϵk (63)

where

l(zk)=g(z˜ks)+g(z˜ks)T(zkz˜ks). (64)

The choice for ks will be discussed later. Now we have a linear Gaussian system and the posterior distribution over zk is Gaussian, 𝒩(zk | ks, Λ). The mean ks and the precision Λ satisfy

Λ(zkz^ks)=Rk(ykl(zk))g(z˜ks)Gks(ζkszk) (65)
Λ=g(z˜ks)Rkg(z˜ks)T+Gks (66)

where ζks = (μks, ρk)T the means of GMM for the speech and noise log-spectrum, and Gks = diag(Bks, Dk) the precisions.

The accuracy of linear approximation strongly depends on the point ks which is the point of expansion for g(zk). A reasonable choice is the MAP. Substitute zk = ks in (65) and use ks = ks, we can obtain an iterative update for ks

z˜ksz˜ks+ηΛ1{Rk(ykg(z˜ks))g(z˜ks)+Gks(ζksz˜ks)}. (67)

The η is the learning rate, and is introduced to avoid oscillation. This iterative update gives the signal log-spectral estimator, ks, which is the first element of the ks.

The state probability p(s | y) is computed as, per Bayes’ rule, p(s | y) ∝ p(y | s) p(s). The state-dependent probability is

p(y|s)=k|Γks2π|exp(Γks2(ykl(ζks))2) (68)

where the mean lks) is given in (64) and the precision Γks = (1/gT G−1 g′ + 1/Rk)).

The log-spectral estimator is k = Σs ksp(s | y). Using the phase of the noisy signal ∠Yk, the signal estimation in frequency domain is given by Xk = exp(k/2)eiYk.

It is observed that Newton’s method with learning rate 1 oscillates; therefore, we set η = 0.5 in our experiments. We initialize the iteration of (67) with two conditions, (yk, ρk)T and (μks, ρk)T, and choose the one that offers higher likelihood value. The number of iterations is 7 which is enough for convergence. Note that the optimization of the two variables x and n increases computational cost.

4) Super Gaussian Prior (SuperGauss)

This method is developed in [25]. Let XR = Re{X} and XI = Im{X} denote the real and the imaginary parts of the signal FFT coefficients. The super Gaussian priors for XR and XI obey double-sided exponential distribution, given by

p(XR)=1σxe2|XR|σx (69)
p(XI)=1σxe2|XI|σx. (70)

Assume the Gaussian density for the noise N, p(N)=𝒩(0,1/σn2). Here, σx2andσn2 are the means of |X|2 and |N|2, respectively. Let ξ=σx2/σn2 be the a priori SNR, YR = Re{Y} be the real part of the noisy signal FFT coefficient. Define LR+=1/ξ+YR/σn, and LR=1/ξYR/σn. It was shown in [25, (11)] that the optimal estimator for the real part is

X^R=YR+σnξe2YRσxerfc(LR+)e2YRσxerfc(LR)e2YRσxerfc(LR+)+e2YRσxerfc(LR) (71)

where erfc(x) denotes the complementary error function. The optimal estimator for the imaginary part I is derived analogously in the same manner. The FFT coefficient estimator is given by = R + iX̂I.

D. Comparison Criteria

The performance of the algorithms are subject to some quality measures. We employ three criteria to evaluate the performances of all algorithms: SNR, SD, and word recognition error rate. For all experiments, the estimated signal [t] are normalized such that it has the same covariance as the clean signal x[t] before computing the signal quality measures.

1) Signal-to-Noise Ratio (SNR)

In time domain, SNR is defined by

SNR=10log10t(x[t])2t(x^[t]x[t])2 (72)

where x[t] is original clean signal, and [t] is estimated signal.

2) Spectral Distortion (SD)

Let xkcandx^kc be the cepstral coefficients of the clean signal and the estimated signal, respectively. The computation of cepstral coefficients is described in Section II-A. The spectral distortion is defined in [25] by

SD=k=116(xkcx^kc)2 (73)

where the first 16 cepstral coefficients are used.

3) Word Recognition Error Rate

We use the speech recognition engine provided on the ICSLP website [24]. The recognizer is based on the HTK package. The inputs of the recognizer include MFCC, its velocity (Δ MFCC) and its acceleration (ΔΔ MFCC) that are extracted from speech waveforms. The words are modeled by the HMM with no skipover states and two states for each phoneme. The emission probability for each state is a GMM of 32 mixtures, of which the covariance matrices are diagonal. The grammar used in the recognizer is the same as the sentence grammar shown in Section V-A. More details about the recognition engine can be found at [24].

For each input SNR condition, the estimated signals are fed into the recognizer. A score of {0, 1, 2, 3} is assigned to each utterance depending on how many key words (color, letter, digit) that are incorrectly recognized. The word recognition error rate in percentage is the average of the scores of all 600 testing utterances divided by 3.

E. Results

1) Performance Comparison With Fixed Gain and Known Noise Model

All the algorithms are applied to enhance the speech corrupted by SSN at various SNR levels. They are compared by SNR, SD, and word recognition error rate. The Wiener filer, which contains the strong and detailed signal prior from a clean speech, can be regarded as an experimental upper bound.

Figs. 5 and 6 show the spectrograms of a female speech and a male speech, respectively. The SNR for the noisy speech is 6 dB. The Wiener filter can recover the spectrogram of the speech. The methods based on the models in log-spectral domain (Linear, LaplaceFFT, LaplaceLS, and Gaussian) can effectively suppress the SSN and recover the spectrogram. Because the SuperGauss estimates the real and imaginary parts separately, the spectral amplitude is not optimally estimated which leads to a blurred spectrogram. The perceptual model (Wofle99) fails to suppress SSN because of its spectral similarity to speech.

Fig. 5.

Fig. 5

Spectrogram of a female speech “lay blue with e four again.” (a) Clean speech. (b) Noisy speech of 6-dB SNR. (c)–(i) Enhanced signals by (c) Wiener filter, (d) perceptual model (Wolfe), (e) linear approximation (Linear), (f) super Gaussian prior (SuperGauss), Laplace method in (g) frequency domain (LaplaceFFT) and in (h) log-spectral domain (LaplaceLS), (i) Gaussian approximation (Gaussian).

Fig. 6.

Fig. 6

Spectrogram of a male speech “lay green at r nine soon.” (a) Clean speech. (b) Noisy speech of 6-dB SNR. (c)–(i) Enhanced signal by various algorithms. See Fig. 5. (a) Cleen Speech. (b) Noisy Speech. (c) Wiener Filter. (d) Wolfe. (e) Linear. (f) SuperGauss. (g) LaplaceFFT. (h) LaplaceLS. (i) Gaussian.

The SNR of speech enhanced by various algorithms are shown in Fig. 7(a). Wiener filter performs the best. Laplace methods (LaplaceFFT and LaplaceLS) are very effective, and the LaplaceLS is better. This coincides with the belief that the log-spectral amplitude estimator is more suitable for speech processing. The Gaussian approximation works comparably well to the Laplace methods with the advantage of greater computational efficiency where no iteration is necessary. The linear approximation provides inferior SNR. The reason is that this approach involves two hidden variables, which may increase the uncertainty for signal estimation. The SuperGauss works better than perceptual model (Wolfe99) which fails to suppress SSN.

Fig. 7.

Fig. 7

Signal-to-noise ratio, spectrum distortion, and recognition error rate of speeches enhanced by the algorithms. The speech is corrupted at four input SNR values. The gain and the noise spectrum are assumed to be known. Wiener: Wiener filter; Wolfe99: perceptual model; Linear: linear approximation; Super-Gauss: super Gaussian prior; LaplaceFFT: Laplace method in frequency domain; LaplaceLS: Laplace method in log-spectral domain; Gaussian: Gaussian approximation; NoDenoising: noisy speech input. (a) Signal-to-noise ratio. (b) Spectral distortion. (c) Recognition error rate.

The SD of speech enhanced by various algorithms are shown in Fig. 7(b). The methods that estimate spectral amplitude (Linear, LaplaceFFT, LaplaceLS) perform close to the Wiener filter. Because the SupperGauss estimates the real part and the imaginary part of FFT coefficients separately, it introduces distortion to the spectral amplitude and gives higher SD. The perceptual model is not effective to suppress SSN.

The word recognition error rate of speech enhanced by various algorithms are shown in Fig. 7(c). The outstanding performance of Wiener filter may be considered as an upper bound. The Linear and LaplaceLS give very low word recognition error rate in the high SNR range, because they estimate the log-spectral amplitude, which is a strong fit to the recognizer input (MFCC). LaplaceLS is better than Linear in the low SNR range, because Linear involves two hidden variables to estimate. The LaplaceFFT and Gaussian also improve the recognition remarkably. Because SuperGauss offers less accurate spectral amplitude estimation and higher SD, it gives lower word recognition rate. The Wolfe99 is not able to suppress SSN and the decrease in performance may be caused by the spectral distortion.

The computation costs of these algorithms are given in Table I. All algorithms are implemented with MATLAB, and the experiments run on a 2.66-GHz PC. The methods based on linear approximation and Laplace method involve iterative optimization; thus, they are more computationally expensive. Their efficiency also depends on the number of initializations and iterations. The methods that do not involve iterations, Wiener filter, Gaussian, SuperGauss, are much faster.

TABLE I.

Computational Time (Seconds) of Processing 10 S of Noisy Speech Sampled at 25 kHz

Wiener Linear SupperGauss
1.15 s 134 s 1.28 s
LaplaceFFT LaplaceLS Gaussian
57.3 s 40.5 s 0.25 s

2) Performance Comparison With Estimated Gain and Noise Spectrum

The performances of the Gaussian approximation with the fixed gain versus the estimated gain and noise spectrum are compared. The SNR, SD, and word recognition error rate of the enhanced speech are shown in Fig. 8(a)–(c), respectively. The performances are almost identical, which demonstrate that, under Gaussian approximation, the learning of gain and noise spectrum is very effective. Estimation of gain and noise degrades the performance compared to the scenario of fixed gain and known noise spectrum very slightly. Furthermore, with clean signal input, the estimated signal still has 32.71-dB SNR for scalar gain and 15.32-dB SNR for vector gain. The recognition error rate is also close to the results of the clean signal input. The slight degradation in the vector gain case is because we have more parameters to estimate.

Fig. 8.

Fig. 8

Signal-to-noise ratio, spectral distortion, and recognition error rate of speeches enhanced by algorithms based on Gaussian approximation. The speech is corrupted by SSN. KnownNoise: known gain and noise spectrum; ScalarGain: estimated frequency-independent gain and noise spectrum; Vector-Gain: estimated frequency dependent gain and noise spectrum; NoDenoising: noisy speech input. (a) Signal-to-noise ratio. (b) Spectral distortion. (c) Recognition error rate.

VI. CONCLUSION

We have developed speech enhancement algorithms based upon approximate Bayesian estimation. These approximations make the GMM in log-spectral domain applicable for speech enhancement. The log-spectral domain Laplace method, which computes the MAP estimator for the log-spectral amplitude, is particularly successful. It offers higher SNR, smaller recognition error rate, and lower SD. This confirms that the log-spectrum is more suitable for speech processing. The estimation of the log-spectral amplitude is a strong fit to the speech recognizer and significantly improves its performance, which makes this approach valuable to the recognition of the noisy speech. However, the Laplace method requires iterative optimization which increases the computational cost. Compared to the Laplace method, the Gaussian approximation with a closed-form signal estimation, is more efficient and performs comparably well. The advantage of fast gain and noise spectrum adaptation makes this algorithm more flexible. In the experiments, the proposed algorithms demonstrate superior performances over the spectral domain models and are able to reduce the noise effectively even when its spectral shape is similar to the speech.

ACKNOWLEDGMENT

The authors would like to thank the anonymous reviewers for valuable comments which significantly improved the presentation.

Biographies

graphic file with name nihms192814b1.gif

Jiucang Hao received the B.S. degree from the University of Science and Technology of China (USTC), Hefei, and the M.S. degree from University of California at San Diego (UCSD), both in physics. He is currently pursuing the Ph.D. degree at UCSD.

His research interests are in developing new machine learning algorithms and applying them to areas such as speech enhancement, source separation, biomedical data analysis, etc.

graphic file with name nihms192814b2.gif

Hagai Attias received the Ph.D. degree in theoretical physics from Yale University, New Haven, CT.

He is the President of Golden Metallic, Inc., San Francisco, CA. He has (co)authored over 60 scientific publications on machine learning theory and its applications in speech and audio processing, machine vision, and biomedical imaging. He has 12 issued patents. He was a Research Scientist at Microsoft Research, Redmond, WA, working in the Machine Learning and Applied Statistics Group. Several of his inventions at Microsoft were incorporated into the speech recognition engine used by the Windows operating system. Prior to that, he was a Sloan Postdoctoral Fellow at University of California, San Francisco (UCSF). At UCSF, he did some of the pioneering work on machine learning algorithms for audio analysis and source separation.

graphic file with name nihms192814b3.gif

Srikantan Nagarajan received the M.S. and Ph.D. degrees in biomedical engineering from Case Western Reserve University, Cleveland, OH.

He did a Postdoctoral Fellowship at the Keck Center for Integrative Neuroscience, University of California, San Francisco (UCSF). Currently, he is a Professor in the Department of Radiology and Biomedical Imaging at UCSF and a faculty member in the UCSF/UCB Joint Graduate Program in Bioengineering. His research interests, in the area of neural engineering and machine learning, are to better understand neural mechanisms of sensorimotor learning and speech motor control, through the development of algorithms for improved functional brain imaging and biomedical signal processing.

graphic file with name nihms192814b4.gif

Te-Won Lee received the M.S. degree and the Ph.D. degree (summa cum laude) in electrical engineering from the University of Technology Berlin, Berlin, Germany, in 1995 and 1997, respectively.

He was Chief Executive Officer and co-Founder of SoftMax, Inc., a start-up company in San Diego, CA, developing software for mobile devices. In December 2007, SoftMax was acquired by Qualcomm, Inc., the world leader in wireless communications where he is now a Senior Director of Technology leading the development of advanced voice signal processing technologies. Prior to Qualcomm and SoftMax, Dr. Lee was a Research Professor at the Institute for Neural Computation, University of California, San Diego, and a collaborating Professor in the Biosystems Department, Korea Advanced Institute of Science and Technology (KAIST). He was a Max-Planck Institute fellow (1995–1997) and a Research Associate at the Salk Institute for Biological Studies (1997–1999).

Dr. Lee received the Erwin-Stephan prize for excellent studies from the University of Technology Berlin and the Carl-Ramhauser prize for excellent dissertations from the Daimler–Chrysler Corporation. In 2007, he received the SPIE Conference Pioneer Award for work on independent component analysis and unsupervised learning algorithms.

graphic file with name nihms192814b5.gif

Terrence J. Sejnowski (SM’91–F’06) is the Francis Crick Professor at The Salk Institute for Biological Studies, La Jolla, CA, where he directs the Computational Neurobiology Laboratory, an Investigator with the Howard Hughes Medical Institute, and a Professor of Biology and Computer Science and Engineering at the University of California, San Diego, where he is Director of the Institute for Neural Computation. The long-range goal of Dr. Sejnowski’s laboratory is to understand the computational resources of brains and to build linking principles from brain to behavior using computational models. This goal is being pursued with a combination of theoretical and experimental approaches at several levels of investigation ranging from the biophysical level to the systems level. His laboratory has developed new methods for analyzing the sources for electrical and magnetic signals recorded from the scalp and hemodynamic signals from functional brain imaging by blind separation using independent components analysis (ICA). He has published over 300 scientific papers and 12 books, including The Computational Brain (MIT Press, 1994), with P. Churchland.

Dr. Sejnowski received the Wright Prize for Interdisciplinary research in 1996, the Hebb Prize from the International Neural Network Society in 1999, and the IEEE Neural Network Pioneer Award in 2002. His was elected an AAAS Fellow in 2006 and to the Institute of Medicine of the National Academies in 2008.

Footnotes

1

The FFT is Xk=n=0K1x[n]e2πikn/K.

2

The IFFT is x[n]=(1/K)k=0K1Xke2πikn/K.

Contributor Information

Jiucang Hao, Institute for Neural Computation, University of California, San Diego, CA 92093-0523 USA..

Hagai Attias, Golden Metallic, Inc., San Francisco, CA 94147 USA..

Srikantan Nagarajan, Department of Radiology, University of California, San Francisco, CA 94143-0628 USA..

Te-Won Lee, Qualcomm, Inc., San Diego, CA 92121 USA..

Terrence J. Sejnowski, Howard Hughes Medical Institute at the Salk Institute, La Jolla, CA 92037 USA, and also with the Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093 USA..

REFERENCES

  • 1.Ephraim Y, Cohen I. The Electrical Engineering Handbook. Boca Raton, FL: CRC; 2006. Recent advancements in speech enhancement. [Google Scholar]
  • 2.Attias H, Platt JC, Acero A, Deng L. Speech denoising and dereverberation using probabilistic models; Proc. NIPS; 2000. pp. 758–764. [Google Scholar]
  • 3.Gannot S, Burshtein D, Weinstein E. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process. 2001 Aug;vol. 49(no. 8):1614–1626. [Google Scholar]
  • 4.Cohen I, Gannot S, Berdugo B. An integrated real-time beam-forming and postfiltering system for nonstationary noise environments. EURASIP J. Appl. Signal Process. 2003;vol. 11:1064–1073. [Google Scholar]
  • 5.Boll SF. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Process. 1979 Apr;vol. ASSP-27(no. 2):113–120. [Google Scholar]
  • 6.Ephraim Y, Trees HLV. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 1995 Jul;vol. 3(no. 4):251–266. [Google Scholar]
  • 7.Hopgood JR, Rayner PJ. Single channel nonstationary stochastic signal separation using linear time-varying filters. IEEE Trans. Signal Process. 2003 Jul;vol. 51(no. 7):1739–1752. [Google Scholar]
  • 8.Czyzewski A, Krolikowski R. Noise reduction in audio signals based on the perceptual coding approach; Proc. IEEE WASPAA; 1999. pp. 147–150. [Google Scholar]
  • 9.Lee J-H, Jung H-J, Lee T-W, Lee S-Y. Speech coding and noise reduction using ICA-based speech features; Proc. Workshop ICA; 2000. pp. 417–422. [Google Scholar]
  • 10.Wolfe P, Godsill S. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement; Proc. ICASSP; 2000. pp. 821–824. [Google Scholar]
  • 11.Ephraim Y. Statistical-model-based speech enhancement systems. Proc. IEEE. 1992 Oct;vol. 80(no. 10):1526–1555. [Google Scholar]
  • 12.Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process. 1984 Dec;vol. ASSP-32(no. 6):1109–1121. [Google Scholar]
  • 13.Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process. 1985 Apr;vol. ASSP-33(no. 2):443–445. [Google Scholar]
  • 14.Ephraim Y. A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans. Signal Process. 1992 Apr;vol. 40(no. 4):725–735. [Google Scholar]
  • 15.Ephraim Y. Gain-adapted hidden Markov models for recognition of clean and noisy speech. IEEE Trans. Signal Process. 1992 Jun;vol. 40(no. 6):1303–1316. [Google Scholar]
  • 16.Burshtein D, Gannot S. Speech enhancement using a mixture-maximum model. IEEE Trans. Speech Audio Process. 2002 Sep;vol. 10(no. 6):341–351. [Google Scholar]
  • 17.Frey B, Kristjansson T, Deng L, Acero A. Learning dynamic noise models from noisy speech for robust speech recognition; Proc. NIPS; 2001. pp. 1165–1171. [Google Scholar]
  • 18.Kristjansson T, Hershey J. High resolution signal reconstruction; Proc. IEEE Workshop ASRU; 2003. pp. 291–296. [Google Scholar]
  • 19.Bishop CM. Neural Networks for Pattern Recognition. New York: Oxford Univ. Press; 1995. [Google Scholar]
  • 20.Attias H. A variational Bayesian framework for graphical models; Proc. NIPS; 2000. pp. 209–215. [Google Scholar]
  • 21.Azevedo-Filho A, Shachter RD. Laplace’s method approximations for probabilistic inference in belief networks with continuous variables; Proc. UAI; 1994. pp. 28–36. [Google Scholar]
  • 22.Cohen I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 2003 Sep;vol. 11(no. 5):466–475. [Google Scholar]
  • 23.Cover TM, Thomas JA. Elements of Information Theory. New York: Wiley-Interscience; 1991. [Google Scholar]
  • 24.Cooke M, Lee T-W. Speech separation challenge. [Online]. Available: http://www.dcs.shef.ac.uk/m̃artin/SpeechSeparationChallenge.html.
  • 25.Martin R. Speech enhancement based on minimum mean-square error estimation and supergaussian priors. IEEE Trans. Speech Audio Process. 2005 Sep;vol. 13(no. 5):845–856. [Google Scholar]
  • 26.Wolfe P. Example of short-time spectral attenuation. [Online]. Available: http://www.eecs.harvard.edu/p̃atrick/research/stsa.html.
  • 27.Cohen I, Berdugo B. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process. Lett. 2002 Jan;vol. 9(no. 1):12–15. [Google Scholar]
  • 28.McAulay R, Malpass M. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust., Speech, Signal Process. 1980 Apr;vol. ASSP-28(no. 2):137–145. [Google Scholar]
  • 29.Martin R. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 2001 Jul;vol. 9(no. 5):504–512. [Google Scholar]
  • 30.Wang D, Lim J. The unimportance of phase in speech enhancement. IEEE Trans. Acoust., Speech, Signal Process. 1982 Apr;vol. SP-30(no. 4):679–681. [Google Scholar]
  • 31.Attias H, Deng L, Acero A, Platt J. A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise; Proc. Eurospeech; 2001. pp. 1903–1906. [Google Scholar]
  • 32.Brandstein MS. On the use of explicit speech modeling in microphone array applications; Proc. ICASSP; 1998. pp. 3613–3616. [Google Scholar]
  • 33.Hong L, Rosca J, Balan R. Independent component analysis based single channel speech enhancement; Proc. ISSPIT; 2003. pp. 522–525. [Google Scholar]
  • 34.Beaugeant C, Scalart P. Speech enhancement using a minimum least-squares amplitude estimator; Proc. IWAENC; 2001. pp. 191–194. [Google Scholar]
  • 35.Letter T, Vary P. Noise reduction by maximum a posterior spectral amplitude estimation with supergaussian speech modeling; Proc. IWAENC; 2003. pp. 83–86. [Google Scholar]
  • 36.Breithaupt C, Martin R. MMSE estimation of magnitude-squared DFT coefficients with supergaussian priors; Proc. ICASSP; 2003. pp. 848–851. [Google Scholar]
  • 37.Benesty J, Chen J, Huang Y, Doclo S. Study of the Wiener filter for noise reduction. In: Benesty J, Makino S, Chen J, editors. Speech Enhancement. new York: Springer; 2005. [Google Scholar]

RESOURCES