Speech Enhancement Using Gaussian Scale Mixture Models

Jiucang Hao; Te-Won Lee; Terrence J Sejnowski

doi:10.1109/TASL.2009.2030012

. Author manuscript; available in PMC: 2011 Feb 25.

Published in final edited form as: IEEE Trans Audio Speech Lang Process. 2010 Aug 11;18(6):1127–1136. doi: 10.1109/TASL.2009.2030012

Speech Enhancement Using Gaussian Scale Mixture Models

Jiucang Hao ¹, Te-Won Lee ², Terrence J Sejnowski ³

PMCID: PMC3045111 NIHMSID: NIHMS270647 PMID: 21359139

Abstract

This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress.

Index Terms: Gaussian scale mixture model (GSMM), Laplace method, speech enhancement, variational approximation

I. Introduction

Speech enhancement improves the quality of signals corrupted by the adverse noise, channel distortion such as competing speakers, background noise, car noise, room reverberations, and low-quality microphones. A broad range of applications includes mobile communications, robust speech recognition, low-quality audio devices, and aids for the hearing impaired.

Although speech enhancement has attracted intensive research [1] and algorithms motivated from different aspects have been developed, it is still an open problem [2] because there are no precise models for both speech and noise [1]. Algorithms based on multiple microphones [2]–[4] and single microphone have also been successful in achieving some measure of speech enhancement [5]–[13].

In spectral subtraction [5], the noise spectrum is subtracted to estimate the spectral magnitude which is believed to be more important than phase for speech quality. Signal subspace methods [6] attempt to find a projection that maps the signal and noise onto disjoint subspaces. The ideal projection splits the signal and noise, and the enhanced signal is constructed from the components that lie in the signal subspace. This approach has been applied to single microphone source separation [14]. Other speech enhancement algorithms have been based on audio coding [15], independent component analysis (ICA) [16] and perceptual models [17].

Statistical-model-based speech enhancement systems [7] have proven to be successful. Both the speech and noise are assumed to obey random processes and treated as random variables. The random processes are specified by the probability density function (pdf) and the dependency among the random variables is described by the conditional probabilities. Because the exact models for speech and noise are unknown [1], speech enhancement algorithms based on various models have been developed. The short-time spectral amplitude (STSA) estimator [8] and the log-spectral amplitude estimator (LSAE) [9] use a Gaussian pdf for both speech and noise in the frequency domain, but differ in signal estimation. The STSA minimizes the minimum mean square error (MMSE) of the spectral amplitude, while the LSAE minimizes the MMSE of the log-spectrum, which is believed to be more suitable for speech processing. Hidden Markov models (HMMs) that include the temporal structure has been developed for clean speech. An HMM with gain adaptation has been applied to the speech enhancement [18] and to the recognition of clean and noisy speech [19]. Super-Gaussian priors, including Gaussian, Laplacian, and Gamma densities, have been used to model the real part and imaginary part of the frequency components [10], and the MMSE estimator used for signal estimation. The log-spectra of speech has often been explicitly and accurately modeled by the Gaussian mixture model (GMM) [11]–[13]. The GMM clusters similar log-spectra together and represents them by a mixture component. The family of GMM has the ability to model any distribution given a sufficient number of mixtures [20], although a small number of mixtures is often enough. However, because signal estimation is intractable, MIXMAX [11] and Taylor expansion [12], [13] are used. Speech enhancement using the log-spectral domain models offers better spectral estimation and is more suitable for speech recognition.

Previous models have estimated either the frequency coefficients or the log-spectra, but not both. The estimated frequency coefficients usually produced better signal quality measured by the signal-to-noise ratio (SNR), but the estimated log-spectra usually provided lower recognition error rate, because higher SNR may not necessarily give a lower error rate. In this paper, we propose a novel approach to estimating both features at the same time. The idea is to specify the relation between the log-spectra and frequency coefficients stochastically. We modeled the log-spectra using a GMM following [11]–[13], where each mixture captures the spectra of similar phonemes. The frequency coefficients obey a Gaussian density whose covariances are the exponentials of the log-spectra. This results in a Gaussian scale mixture model (GSMM) [21], which has been applied to the time-frequency surface estimation [22], separation of of the sparse sources [23], and musical audio coding [24]. In a probabilistic setting, both features can be estimated. An approximate EM algorithm was developed to train the model and two approaches, the Laplace method [25] and the variational approximation [26], were used for signal estimation. The enhanced signals can be constructed from either the estimated frequency coefficients or the estimated log-spectra, depending on the applications.

This paper is organized as the follows. Section II introduces the GSMM for the speech and the Gaussian for the noise. In Section III, an EM algorithm for parameter estimation is derived. Section IV presents the Laplace method and a variational approximation for the signal estimation. Section V shows the experimental results and the comparisons to other algorithms applied to enhance the speeches corrupted by speech shaped noise (SSN) and Gaussian noise. Section VI concludes the paper.

Notation

We use x[t], y[t], and n[t] to denote the time domain signal for clean speech, noisy speech, and noise, respectively. The upper cases X_kt, Y_kt, and N_kt denote the frequency coefficients for frequency bin k at frame t. The ξ_kt is the log-spectrum. The Inline graphic (ξ_k|μ_ks, ν_ks) is a Gaussian density for ξ_kt. with mean μ_ks and precision ν_ks which is defined as the inverse of the variance 1/ν_ks = E{|ξ_k − μ_ks|²|s}, where s is the mixture.

II. Gaussian Scale Mixture Model

A. Acoustic Model

Assuming additive noise, the time domain acoustic model is y[t] = x[t] + n[i]. After fast Fourier transform (FFT) it becomes

Y_{k} = X_{k} + N_{k}

(1)

where k denotes the frequency bin.

The noise is modeled by a Gaussian

p (Y_{k} ∣ X_{k}) = N (Y_{k} ∣ X_{k}, γ_{k}) = \frac{γ_{k}}{π} e^{- γ_{k} {∣ Y_{k} - X_{k} ∣}^{2}}

(2)

with zero mean and precision 1/γ_k = E{|Y_k − X_k|²}. Note this Gaussian is of a complex variable, because the FFT coefficients are complex.

B. Improperness of the Log-Normal Distribution for X_k

If the log-spectra x_k = log (|X_k|²) are modeled by a GMM, for each mixture s,

p (x_{k} ∣ s) = \sqrt{\frac{ν_{k s}}{2 π}} e^{- (ν_{k s} / 2) {(x_{k} - μ_{k s})}^{2}}

(3)

is a Gaussian with mean μ_ks and precision ν_ks. Express $X_{k} = X_{k}^{'} + {i X}_{k}^{″}$ by its real and imaginary parts. Then $X_{k}^{'} = e^{x_{k} / 2} cos θ_{k}$ and $X_{k}^{″} = e^{x_{k} / 2} sin θ_{k}$ , where θ_k is the phase. If the phase is uniformly distributed, p(θ_k) = (1/2π), the pdf for X_k is $p (X_{k} ∣ s) = p (X_{k}^{'}, X_{k}^{″} ∣ s) = (1 / J_{k}) p (x_{k} ∣ s) p (θ_{k})$ , where J_k is the Jacobian $J_{k} = (\partial (X_{k}^{'}, X_{k}^{″}) / \partial (x_{k}, θ_{k})) = {∣ X_{k} ∣}^{2} / 2$ . We have

p (X_{k} ∣ s) = \frac{1}{π {∣ X_{k} ∣}^{2}} \sqrt{\frac{ν_{k s}}{2 π}} e^{- (ν_{k s} / 2) {(log ({∣ X_{k} ∣}^{2}) - μ_{k s})}^{2}}

(4)

as plotted in Fig. 1. This is a log-normal pdf because log (|X_k|²) is normally distributed. Note that it has a saddle shape around zero. In contrast, for real speech, the pdf of the FFT coefficients is super-Gaussian and has a peak at zero.

C. Gaussian Scale Mixture Model for Speech Prior

Instead of assuming x_k = log (|X_k|²), we model this relation stochastically. To avoid confusion, we denote the random variable for the log-spectra as ξ_k. The conditional probability is

p (X_{k} ∣ ξ_{k}) = \frac{e^{- ξ_{k}}}{π} e^{- e^{- ξ_{k}} {∣ X_{k} ∣}^{2}} .

(5)

This is a Gaussian pdf with mean zero and precision e^−ξ_k. Note that ξ_k controls the scaling of X_k. Consider log p(X_k|ξ_k) = −ξ_k − e^−ξ_k|X|² − log π, and its maximum is given by

{\hat{ξ}}_{k} = arg max_{ξ_{k}} p (X_{k} ∣ ξ_{k}) = log {∣ X_{k} ∣}^{2} .

(6)

Thus, we term ξ_k the log-spectrum.

The phonemes of speech have particular spectra across frequency. To group phonemes of similar spectra together and represent them efficiently, we model the log-spectra by a GMM

p (ξ_{k} ∣ s) = N (ξ_{k} ∣ μ_{k s}, ν_{k s}) = \sqrt{\frac{ν_{k s}}{2 π}} e^{- (ν_{k s} / 2) {(ξ_{k} - μ_{k s})}^{2}}

(7)

p (ξ_{1}, \dots, ξ_{K}) = \sum_{s} p (s) \prod_{k} p (ξ_{k} ∣ s)

(8)

where s is the mixture index. Each mixture presents a template of log-spectra, with a corresponding variability allowed for each template via the Gaussian mixture component variances. The mixture may correspond to particular phonemes with similar spectra. Though the precision for ξ is diagonal, p(ξ₁,…, ξ_K) does not factorize over k, i.e., the frequency bins are dependent. The pdf for X_k is

p (X_{1}, \dots, X_{K}) = \sum_{s} p (s) \prod_{k} \int {d ξ}_{k} p (X_{k} ∣ ξ_{k}) p (ξ_{k} ∣ s)

(9)

which is the GSMM because ξ_k controls the scaling of X_k and obeys a GMM [21]. Note that {X₁,−, X_K} are statistically dependent because of the dependency among {ξ₁, −, ξ_K}.

The GSMM has a peak at zero and is super-Gaussian [21]. It is more peaky and has heavier tails than Gaussian, as shown in Fig. 1. The GSMM, which is unimodal and super Gaussian, is a proper model for speech and has been used in audio processing [22]–[24].

III. EM Algorithm for Training the GSMM

The parameters of the GSMM, θ = {μ_ks, ν_ks, p(s)}, are estimated from the training samples by maximum likelihood (ML) using EM algorithm [27]. The log-likelihood is

\begin{array}{l} L (θ) = \sum_{t} log p (X_{1 t}, \dots, X_{k t}) \\ = \sum_{t} log (\sum_{s_{t}} p (s_{t}) \prod_{k} \int p (X_{k t} ∣ ξ_{k t}) p (ξ_{k t} ∣ s_{t}) {d ξ}_{k t}) \\ \geq \sum_{t, s_{t}} \int q (s_{t}) \prod_{k} q (ξ_{k t} ∣ s_{t}) \times log \frac{p (s_{t}) \prod_{k} p (X_{k t} ∣ ξ_{k t}) p (ξ_{k t} ∣ s_{t})}{q (s_{t}) \prod_{k} q (ξ_{k t} ∣ s_{t})} {d ξ}_{1 t} \dots {d ξ}_{K t} \\ = F (q, θ) . \end{array}

(10)

The inequality holds for any choice of distribution q due to Jensen’s inequality [28]. The EM algorithm iteratively optimizes Inline graphic (q, θ) over q and θ. When q equals the posterior distribution q(ξ₁_t, … ξ_Kt, s_t) = p(ξ₁_t, …, ξ_Kt, s_t|X₁_t, …, X_kt), the lower bound is tight, {q, θ) = (θ). The details of the EM algorithm are given in the Appendix.

IV. Two Signal Estimation Approaches

To recover the signal, we need the posterior pdf of the speech. However, for sophisticated models, the closed-form solutions for the posterior pdf are difficult to obtain. To enhance the tractability, we use the Laplace method [25] and a variational approximation [26].

Each frame is independent and processed sequentially. The frame index t is omitted for simplicity. We rewrite the full model as

\prod_{k} p (Y_{k} ∣ X_{k}) p (X_{k} ∣ ξ_{k}) p (ξ_{k} ∣ s) p (s)

(11)

where p(Y_k|X_k) is given by (2), p(X_k|ξ_k) is given by (5), p(ξ_k|s) is a GMM given in (8) and p(s) is the mixture probability.

A. Laplace Method for Signal Estimation

The Laplace method [25] computes maximum a posteriori (MAP) estimator for each s. We estimate X_k and ξ_k by maximizing

\begin{array}{l} log p (X_{k}, ξ_{k} ∣ Y_{k}, s) = log p (Y_{k} ∣ X_{k}) + log p (X_{k} ∣ ξ_{k}) + log p (ξ_{k} ∣ s) + c \\ = - γ_{k} {∣ Y_{k} - X_{k} ∣}^{2} - ξ_{k} - e^{- ξ_{k}} {∣ X_{k} ∣}^{2} - \frac{ν_{k s}}{2} {(ξ_{k} - μ_{k s})}^{2} + c \\ = h_{s} (X_{k}, ξ_{k}) . \end{array}

(12)

For fixed ξ_k, the MAP estimator for X_k is

X_{k} = \frac{γ_{k} Y_{k}}{γ_{k} + e^{- ξ_{k}}} .

(13)

For fixed X_k, the optimization over ξ_k can be performed using Newton’s method.

ξ_{k s} \leftarrow ξ_{k s} - \frac{\frac{\partial h_{s} (X_{k}, ξ_{k})}{\partial ξ_{k}} ∣_{ξ_{k} = ξ_{k s}}}{\frac{\partial^{2} h_{s} (X_{k}, ξ_{k})}{\partial ξ_{k}^{2}} ∣_{ξ_{k} = ξ_{k s}}}

(14)

where (∂h_s(X_k, ξ_k)/∂ξ_k) = −1 + e^−ξ_k|X_k|² − ν_ks{ξ_k − μ_ks) and $(\partial^{2} h_{s} (X_{k}, ξ_{k}) / \partial ξ_{k}^{2}) = - e^{- ξ_{k}} {∣ X_{k} ∣}^{2} - ν_{k s}$ . This update rule is initialized by both ξ_ks = μ_ks, the means of GSMM and ξ_ks = log|Y_k|², the noisy log-spectra. After iterating to convergence, the ξ_ks that gives higher value of h_s(X_k, ξ_k) is selected. Note that because h_s(X_k, ξ_k) is a concave function in ξ_k, $(\partial^{2} h_{s} (X_{k}, ξ_{k}) / \partial ξ_{k}^{2}) < 0$ , Newton’s method works efficiently.

Denote the convergent value for ξ_ks from (14) as ξ̄_ks and compute X̄_ks using (13). We obtain the MAP estimators

({\bar{X}}_{k s}, {\bar{ξ}}_{k s}) = arg max_{X_{k}, ξ_{k}} log p (X_{k}, ξ_{k} ∣ Y_{k}, s) .

(15)

Because the true s is unknown, the estimators are averaged over all mixtures. The posterior mixture probability is

p (s ∣ Y_{1}, \dots, Y_{K}) \propto p (s) \prod_{k} \int p (Y_{k} ∣ X_{k}) p (X_{k} ∣ s) {d X}_{k}

(16)

where p(X_k|s) = ∫p(X_k|ξ_k)p(ξ_k|s)dξ_k. This integral is intractable. The p(X_k|s) has zero mean and variance β_ks = ∫|X_k|²p(X_k|s)dX_k = e^{μ_ks+1/(2ν_ks)}, and is approximated by p(X_k|s) ≈ Inline graphic (X_k|0, 1/β_ks). Under this approximation, we have

p (s ∣ Y_{1}, \dots, Y_{K}) \propto p (s) \prod_{k} N (Y_{k} ∣ 0, \frac{1}{\frac{1}{γ_{k}} + e^{μ_{k s} + 1 / (2 ν_{k s})}}) .

(17)

The estimated signal can be constructed from the average of either X̄_ks or ξ̄_ks, weighted by the posterior mixture probability

{\hat{X}}_{k} = \sum_{s} {\bar{X}}_{k s} p (s ∣ Y_{1}, \dots, Y_{K})

(18)

{\hat{ξ}}_{k} = \sum_{s} {\bar{ξ}}_{k s} p (s ∣ Y_{1}, \dots, Y_{K})

(19)

{\hat{X}}_{k}^{l s} = e^{{\hat{ξ}}_{k} / 2} e^{i ∠ Y_{k}}

(20)

where the phase of the noisy signal ∠Y_k is used. The time domain signal is synthesized by applying inverse fast Fourier transform (IFFT).

B. Variational Approximation for Signal Estimation

Variational approximation [26] employs a factorized posterior pdf. Here, we assume the posterior pdf over X_k and ξ_k conditioned on s factorizes

p (X_{k}, ξ_{k}, ∣ Y_{k}, s) \approx q (X_{k} ∣ s) q (ξ_{k} ∣ s) .

(21)

The difference between q and the true posterior is measured by the Kullback–Leibler (KL)-divergence [28], D, defined as

D (q | | p) = - E^{q} {log \frac{p (s ∣ Y_{1}, \dots, Y_{K}) \prod_{k} p (X_{k}, ξ_{k} ∣ Y_{k}, s)}{q (s) \prod_{k} q (X_{k} ∣ s) q (ξ_{k} ∣ s)}}

(22)

where E^q is the expectation over q. Choose the optimal q that is closest to the true posterior in the sense of the KL -divergence, q = arg min_q D(q||p).

Following the derivation in [26], the optimal q(X_k|s) satisfies

log q (X_{k} ∣ s) \propto log p (Y_{k} ∣ X_{k}) + \int {d ξ}_{k} q (ξ_{k} ∣ s) log p (X_{k} ∣ ξ_{k}) \propto - γ_{k} {∣ Y_{k} - X_{k} ∣}^{2} - \int e^{- ξ_{k}} q (ξ_{k} ∣ s) {d ξ}_{k} {∣ X_{k} ∣}^{2} .

(23)

As shown later in (28), we can use q(ξ_k|s) = Inline graphic (ξ_k|ξ̄_ks, ψ_ks). Because the above equation is quadratic in X_k, q(X_k|s) is Gaussian

q (X_{k} ∣ s) = N (X_{k} ∣ {\bar{X}}_{k s}, ϕ_{k s})

(24)

{\bar{X}}_{k s} = \frac{γ_{k}}{ϕ_{k s}} Y_{k}

(25)

ϕ_{k s} = γ_{k} + e^{- {\bar{ξ}}_{k s} + 1 / (2 ψ_{k s})} .

(26)

The optimal q(ξ_k|s) that minimizes D(q||p) is

log q (ξ_{k} ∣ s) \propto \int {d X}_{k} q (X_{k} ∣ s) log p (X_{k} ∣ ξ_{k}) + log p (ξ_{k} ∣ s) \propto - ξ_{k} - e^{- ξ_{k}} \int {∣ X_{k} ∣}^{2} q (X_{k} ∣ s) {d X}_{k} - \frac{ν_{k s}}{2} {(ξ_{k} - μ_{k s})}^{2} .

(27)

Because this pdf is hard to work with, we use the Laplace method to approximate it by a Gaussian

q (ξ_{k} ∣ s) = N (ξ_{k} ∣ {\bar{ξ}}_{k s}, ψ_{k s})

(28)

{\bar{ξ}}_{k s} = ρ_{k s} + \frac{1}{ψ_{k s}} (e^{- ρ_{k s}} ({∣ {\bar{X}}_{k s} ∣}^{2} + \frac{1}{ϕ_{k s}}) - ν_{k s} (ρ_{k s} - μ_{k s}) - 1)

(29)

ψ_{k s} = e^{- ρ_{k s}} ({∣ {\bar{X}}_{k s} ∣}^{2} + \frac{1}{ϕ_{k s}}) + ν_{k s} .

(30)

The ρ_ks is chosen to be the posterior mode, ρ_ks = ξ̄_ks, the update rule is

{\bar{ξ}}_{k s} \leftarrow {\bar{ξ}}_{k s} + \frac{1}{ψ_{k s}} (e^{- {\bar{ξ}}_{k s}} ({∣ {\bar{X}}_{k s} ∣}^{2} + \frac{1}{ϕ_{k s}}) - ν_{k s} ({\bar{ξ}}_{k s} - μ_{k s}) - 1)

(31)

ψ_{k s} \leftarrow e^{{- \bar{ξ}}_{k s}} ({∣ {\bar{X}}_{k s} ∣}^{2} + \frac{1}{ϕ_{k s}}) + ν_{k s} .

(32)

The ψ_ks > 0 indicates log q(ξ_k|s) is a concave function in ξ_k, thus Newton’s method is efficient.

The variational algorithm is initialized with ξ_ks = log(|Y_k|²) and ϕ_ks = γ_k + exp(−ξ_ks). Note that X̄_ks in (25) can be substituted into (31) and (32) to avoid redundant computation. Then the updates over ψ_ks, ξ_ks and ϕ_ks iterate until convergence.

To compute the posterior mixture probability, we define

\begin{array}{l} g_{k s} = \int q (X_{k} ∣ s) q (ξ_{k} ∣ s) log \frac{p (Y_{k} ∣ X_{k}) p (X_{k} ∣ ξ_{k}) p (ξ_{k} ∣ s)}{q (X_{k} ∣ s) q (ξ_{k} ∣ s)} \\ = log \frac{γ_{k} \sqrt{ν_{k s}}}{π ϕ_{k s} \sqrt{ψ_{k s}}} - γ_{k} {∣ Y_{k} ∣}^{2} + ϕ_{k s} {∣ {\bar{X}}_{k s} ∣}^{2} - {\bar{ξ}}_{k s} - \frac{ν_{k s}}{2} [{({\bar{ξ}}_{k s} - μ_{k s})}^{2} + \frac{1}{ψ_{k s}}] + \frac{1}{2} . \end{array}

(33)

The posterior mixture probability is

q (s) = \frac{exp (\sum_{k} g_{k s}) p (s)}{Z}

(34)

Z = \sum_{s} exp (\sum_{k} g_{k s}) p (s) .

(35)

The function log(Z) = log p(Y₁, …, Y_K) − D(q||p) increases when D(q||p) decreases. Because we use a Gaussian for q(ξ_k|s), log(Z) is not theoretically guaranteed to increase, but it is used empirically to monitor the convergence.

With the estimated log-spectra ξ̄_ks, FFT coefficients X̄_ks, and posterior mixture probability q(s), signals are constructed in two ways given by (18) and (20). Time domain signal is synthesized by applying IFFT.

V. Experiments

The performances of the algorithms were evaluated using the materials provided by the speech separation challenge [29].

A. Dataset Description

The data set contained six-word segments of 34 speakers. Each segment was 1–2 seconds long sampled at 25 kHz. The acoustic signal followed the grammar, 〈$command〉 〈$color〉 〈$preposition〉 〈$letter〉 〈$number〉 〈$adverb〉. There were 25 choices for letter (A–Z except W), ten choices for number and four choices for others. The training set contained segments of clean signals for each speaker, and the test set contained speeches corrupted by noise. The spectra of speech and noise averaged over one segment are shown in Fig. 2. In the plot, the speech and noise have the same power, i.e, 0-dB SNR. Because the spectrum of noise has the similar shape to that of speech, it is called speech shape noise (SSN). The test data consisted of noisy signals at four different SNRs, −12 dB, −6 dB, 0 dB, and 6 dB. There were 600 utterances for each SNR condition from all 34 speakers who contributed roughly equally. The task is to recover the speech signals corrupted by SSN. The performances of the algorithms were compared by the word recognition error rate using the provided speech recognition engine [29].

Fig. 2 — Plot of spectra of noise (dotted line) and clean speech (solid line) averaged over one segments under 0-dB SNR. Note the similar spectral shape.

To evaluate our algorithm under different types of noise, we added the white Gaussian noise to the clean signals at SNR levels of −12 dB, −6 dB, 0 dB, 6 dB, 12 dB, to generate noisy signals.

The signal is divided into frames of length 800 with half over-lapping, and a Hanning window of size 800 is applied to each frame. Then a 1024-point FFT is performed on the zero-padded frames to extract the frequency components. The log-spectral coefficients were obtained by taking the log magnitude of the FFT coefficients. Due to the symmetry of FFT, only first 513 components were kept.

B. Training the Gaussian Scale Mixture Model

The GSMM with 30 mixtures was trained using 2 min of signal concatenated from the training set for each speaker. We applied the k-mean algorithm to partition the log-spectra into k = 30 clusters. They were used to initialize the GMM which was further trained by standard EM algorithm. Initialized by the GMM, we ran the derived EM algorithm in Section III to train the GSMM. After training, the speech model was fixed and served as signal prior. It was not updated when processing the noisy signals.

C. Benchmarks for Comparison

The benchmark algorithms included the Wiener filter, STSA [8], the perceptual model [17], the linear approximation [12], [13], and the super-Gaussian model [10]. The spectrum of noise was assumed to be known and estimated from the noise.

1) Wiener Filter

The time varying Wiener filter makes use of the power of the signal and noise, and assumes they are stationary for a short period of time. In the experiment, we first divided the signals into frames of 800 samples long with half overlapping. The power of speech and noise was constant within each frame. To estimate them, we further divided each frame into sub-frames of 200-sample long with half overlapping. The sub-frames are zero-padded to 256 points, Hanning windows were applied and a 256-points FFT was performed. The average power of FFT coefficients over all sub-frames belong to frame t gave the estimation of the signal power, denoted by $P_{t k}^{x}$ . The same method was used to compute the noise power denoted by $P_{t k}^{n}$ . The signal was estimated as $X_{tjk} = (p_{t k}^{x} / (P_{t k}^{x} + P_{t k}^{n})) Y_{tjk}$ where j is the sub-frame index and k denotes the frequency bin. Applying IFFT on X_tjk, each frame can be synthesized by overlap-adding the sub-frames, and the estimated speech signal was obtained by overlap-adding the frames.

The performance of the Wiener filter can be regarded as an experimental upper bound. The signal and noise power was derived locally for each frame from the clean speech and noise. So the Wiener filter contained strong detailed speech priors.

2) STSA

After performing the 1024-point FFT on the zero-padded frames of length 800. The STSA models the FFT coefficients of the speech and noise by a single Gaussian, respectively, whose variances are estimated from clean signal and noise. The amplitude estimator is given by [8, Eq. (7)].

3) Perceptual Model

Because we consider the SSN, it is interesting to test the performance of the perceptually motivated noise reduction technique. The spectral similarity may pose difficulty to such models. For this purpose, we included the method described in [17]. The algorithm estimated the spectral amplitude by minimizing the cost function

C ({\hat{a}}_{k}, a_{k}) = {\begin{array}{l} {({\hat{a}}_{k} - a_{k} - \frac{m_{k}}{2})}^{2} - {(\frac{m_{k}}{2})}^{2}, & if ∣ {\hat{a}}_{k} - a_{k} - \frac{m_{k}}{2} ∣ > \frac{m_{k}}{2} \\ 0, & otherwise . \end{array}

(36)

where â_k is the estimated spectral amplitude and a_k is the true spectral amplitude. This cost function penalizes the positive and negative errors differently, because positive estimation errors are perceived as additive noise and negative errors are perceived as signal attenuation [17]. Because of the stochastic property of speech, â_k minimizes the expected cost function

{\hat{a}}_{k} = arg min_{{\hat{a}}_{k}} \int \int C ({\hat{a}}_{k}, a_{k}) p (α_{k}, a_{k} ∣ Y_{k}) {d α}_{k} {d a}_{k}

(37)

where α_k is the phase and p(α_k, a_k|Y_k) is the posterior signal distribution. Details of the algorithm can be found in [17]. The MATLAB code is available online [30]. The original code adds synthetic white noise to the clean signal, we modified it to add SSN to corrupt a speech at different SNR levels.

4) Linear Approximation

This approach was developed in [12], [13] and worked in the log-spectral domain. It assumed a GMM for the signal log-spectra and a Gaussian for the noise log-spectra. So the noise had a log-normal density, in contrast to Gaussian noise. The relationship among the log-spectra of the signal x, the noisy signal y and the noise n is given by

y_{k} = x_{k} + log (1 + exp (n_{k} - x_{k})) + ε_{k}

(38)

where ε_k is an error term.

However, this nonlinear relationship causes intractability. A linear approximation was used in [12], [13] by expanding (38) around z̃_ks = (x̃_ks, ñ_ks)^T linearly. This approximation provided efficient speech enhancement. The choice for z̃_ks can be iteratively optimized.

5) Super-Gaussian Prior

This method was developed in [10]. Let X_R = Re{X} and X_I = Im{X} denote the real and the imaginary part of the signal FFT coefficients. They were processed separately and symmetrically. We consider the real part and assume X_R obey double-sided exponential distribution

p (X_{R}) = \frac{1}{σ_{x}} e^{- (2 ∣ X_{R} ∣ / σ_{x})} .

(39)

Assume the Gaussian noise N with density $p (N) = N (0, 1 / σ_{n}^{2})$ . Here, $σ_{x}^{2}$ and $σ_{n}^{2}$ are the means of |X|² and |N|², respectively. Let $ξ = σ_{x}^{2} / σ_{n}^{2}$ be the a priori SNR, Y_R = Re{Y} be the real part of the noisy signal FFT coefficient. Define $L_{R +} = 1 / \sqrt{ξ} + Y_{R} / σ_{n}$ , and $L_{R -} = 1 / \sqrt{ξ} - Y_{R} / σ_{n}$ . It was shown in [10, Eq. (11)] that the optimal estimator for the real part is

{\hat{X}}_{R} = Y_{R} + \frac{σ_{n}}{\sqrt{ξ}} \frac{e^{2 Y_{R} / σ_{x}} erfc (L_{R +}) - e^{- (2 Y_{R} / σ_{x})} erfc (L_{R -})}{e^{2 Y_{R} / σ_{x}} erfc (L_{R +}) + e^{- (2 Y_{R} / σ_{x})} erfc (L_{R -})}

(40)

where erfc(x) denotes the complementary error function. The optimal estimator for the imaginary part X̂_I was derived analogously in the same manner. The FFT coefficient estimator was given by X̂ = X̂_R + iX̂_I.

D. Comparison Criteria

We employed two criteria to evaluate performance of all algorithms: SNR and word recognition error rate. In all experiments, the estimated time domain signals x̂[t] were normalized such that they have the same power as the clean signals.

1) Signal-to-Noise Ratio (SNR)

SNR is defined in the time domain as

SNR = 10 {log}_{10} \frac{\sum_{t} {∣ x (t) ∣}^{2}}{\sum_{t} {∣ \hat{x} (t) - x (t) ∣}^{2}}

(41)

where x[t] is the clean signal and x̂[t] is the estimated signal.

2) Word Recognition Error Rate

The speech recognition engine based on the HTK package was provided on the ICSLP website [29]. It extracts 39 features from the acoustic waveforms, including 12 Mel-frequency cepstral coefficients (MFCC) and the logarithmic frame energy, their velocities (Δ MFCC) and accelerations (ΔΔ MFCC). The HMM with no skipover states and two states for each phoneme was used to model each word. The emission probability for each state was a GMM of 32 mixtures, of which the covariance matrices are diagonal. The grammar used in the recognizer is the same as the one shown in Section V-A. More details about the recognition engine are provided at [29].

To compute the recognition error rate, a score of {0, 1, 2, 3} was assigned to each utterance depending on how many key words (color, letter, digit) were incorrectly recognized. The average word recognition error rate was the average of the scores of all 600 testing utterances divided by 3, i.e., the percentage of wrongly recognized key words. This was carried out for each SNR condition.

E. Results

1) Speech Shaped Noise

We applied the algorithms to enhance the speech corrupted by SSN at four SNR levels and compared them by SNR and word recognition error rate. The Wiener filer was regarded as an experimental upper bound, because it incorporates detailed signal prior from the clean speech.

The spectrograms of female speech and male speech are shown in Figs. 3 and 4, respectively. Fig. 5 shows the output SNR as a function of input SNR for all algorithms. The output SNR is averaged over the 600 test segments. Fig. 6 plots the word recognition error rate.

Fig. 3 — Spectrogram of a female speech “lay blue with e four again.” (a) Clean speech; (b) noisy speech of 6-dB SNR; (c–j) enhanced signals by (c) Wiener filter, (d) STSA, (e) perceptual model (Wolfe), (f) linear approximation (Linear), (g) super Gaussian prior (SuperGauss), (h) FFT coefficients estimation by GSMM using Laplace method, see (18), (i) log-spectra estimation by GSMM using Laplace method, see (20), (j) FFT coefficients estimation by GSMM using variational approximation, (k) log-spectra estimation by GSMM using variational approximation.

Fig. 4 — Spectrogram of a male speech “lay green at r nine soon.” (a) Clean speech; (b) noisy speech of 6-dB SNR; (c–i) enhanced signal by various algorithms. See Fig. 3.

Fig. 5 — Output SNRs as a function of the input SNR for nine models (inset) for the case that the speeches are corrupted by SSN. See Fig. 3 for description of algorithms.

Fig. 6 — Word recognition error rate as a function of the input SNR for nine models (inset) for the case that the speeches are corrupted by SSN. See Fig. 3 for description of algorithms.

The Wiener filter outperformed other methods in low SNR conditions. This is because the power of noise and speech was calculated locally, and it incorporated detailed prior information. The perceptual model and STSA failed to suppress the SSN because of the spectral similarity between the speech and the noise. The linear approximation gave very low word recognition error rate, but not superior SNR. The reason is that, using a GMM in the log-spectral domain as speech model, it reliably estimated the log-spectrum which is a good fit to the recognizer input (MFCC). Because the super-Gaussian prior model treated the real and imaginary parts of the FFT coefficients separately, it provided less accurate spectral amplitude estimation and was inferior to the linear approximation. Both the Laplace method and variational approximation, based on GSMM for the speech signal, gave superior SNR for signals constructed from the estimated FFT coefficients and lower word recognition error rate for signals constructed from the estimated log-spectra. This agreed with the expectation that frequency domain approach gave higher SNR, while log-spectral domain method was more suitable for speech recognition. In comparing the two methods, the variational approximation performed better than the Laplace method in the high SNR range. It is hard to compare them in the low SNR range, because speech enhancement was minimal.

Perceptually, the Wiener filter gave smooth and natural speeches. The signals enhanced by STSA, perceptual model, and supper-Gausian prior model, contained obvious noise, because such techniques are based on spectral analysis and failed to remove the SSN. The linear approximation removed the noise, but the output signals were discontinuous. For the algorithms based on Gaussian scale mixture models, the signals constructed from the estimated FFT coefficients were smoother than those constructed from the log-spectra. The reason was that the perceptual quality of signals was sensitive to the log-spectra, because the amplitudes were obtained by taking the exponential of the log-spectra. The discontinuity in the log-spectra was more noticeable than that in the FFT coefficients. Because the phase of the noisy signals was used to synthesize the estimated signals, the enhanced signals contained reverberation. Among all the algorithms, we found GSMM with Laplace method gave the most satisfactory results, the noise was removed and the signals were smooth. The examples are available at http://chord.ucsd.edu/~jiucang/gsmm.

2) White Gaussian Noise

We also applied the algorithms to enhance the speeches corrupted by the white Gaussian noise. For this experiment, we tested them under five SNR levels: −12 dB, −6 dB, 0 dB, 6 dB, and 12 dB. The algorithms were the same as the previous section. Fig. 7 shows the output SNRs and Fig. 8 plotts the word recognition error rate.

Fig. 7 — Output SNRs as a function of the input SNR for nine models (inset) for the case that the speeches are corrupted by white Gaussian noise. See Fig. 3 for description of algorithms.

Fig. 8 — Word recognition error rate as a function of the input SNR for nine models (inset) for the case that the speeches are corrupted by white Gaussian noise. See Fig. 3 for description of algorithms.

We noticed that all the algorithms were able to improve the SNR. The signals constructed from the FFT coefficients estimated from the GSMM with Laplace method gave the best output SNR for all SNR inputs. The spectral analysis models, like STSA and perceptual models, were able to improve the SNR too, because of the spectral difference between the signal and noise. The algorithms that estimated the log-spectra (Linear, GSMM Lap LS, and GSMM VarLS) gave the lower word recognition error rate, because the log-spectra estimation was a good fit to the recognizer. For the GSMM, the FFT coefficients estimation offered better SNR and log-spectra estimation offered lower recognition error rate, as expected.

Although STSA, perceptual model and super-Gaussian prior all increased SNR, the residual noise was perceptually noticeable. Signals constructed from the estimated log-spectra sounded less continuous than signals constructed from the estimated FFT coefficients. However, the signals sounded like being synthesized, because the phase of the noisy signal was used. The examples are available at http://chord.ucsd.edu/~ji-ucang/gsmm.

VI. Conclusion

We have presented a novel Gaussian scale mixture model for speech signal and derived two methods for speech enhancement: the Laplace method and a variational approximation. The GSMM treats the FFT coefficients and log-spectra as two random variables, and models their relationship probabilistically. This enables us to estimate both the FFT coefficients, which produce better signal quality in the time domain, and the log-spectra, which are more suitable for speech recognition. The performances of the proposed algorithms were demonstrated by applying them to enhance speech corrupted by SSN and the white noise. The FFT coefficients estimation gave higher SNR, while the log-spectra estimation produced lower word recognition error rate.

Acknowledgments

The authors would like to thank H. Attias for suggesting the model and helpful advice on the inference algorithms. They would also like to thank the anonymous reviewers for valuable suggestions.

Biographies

graphic file with name nihms270647b1.gif Jiucang Hao received the B.S. degree from the University of Science and Technology of China (USTC), Hefei, and the M.S. degree from University of California at San Diego (UCSD), both in physics. He is currently pursuing the Ph.D. degree at UCSD.

His research interests are to develop new machine learning algorithms and apply them to areas such as speech enhancement, source separation, biomedical data analysis, etc.

graphic file with name nihms270647b2.gif Te-Won Lee (M’03–SM’06) received the M.S. degree and the Ph.D. degree (summa cum laude) in electrical engineering from the University of Technology Berlin, Berlin, Germany, in 1995 and 1997, respectively.

He was Chief Executive Officer and co-Founder of SoftMax, Inc., a start-up company in San Diego developing software for mobile devices. In December 2007, SoftMax was acquired by Qualcomm, Inc., the world leader in wireless communications where he is now a Senior Director of Technology leading the development of advanced voice signal processing technologies. Prior to Qualcomm and SoftMax, he was a Research Professor at the Institute for Neural Computation, University of California, San Diego, and a Collaborating Professor in the Biosystems Department, Korea Advanced Institute of Science and Technology (KAIST). He was a Max-Planck Institute Fellow (1995–1997) and a Research Associate at the Salk Institute for Biological Studies (1997–1999).

Dr. Lee received the Erwin-Stephan Prize for excellent studies (1994) from the University of Technology Berlin, the Carl-Ramhauser prize (1998) for excellent dissertations from the DaimlerChrysler Corporation and the ICA Unsupervised Learning Pioneer Award (2007). In 2007, he received the SPIE Conference Pioneer Award for work on independent component analysis and unsupervised learning algorithms.

graphic file with name nihms270647b3.gif Terrence J. Sejnowski (SM’91–F’06) is the Francis Crick Professor at The Salk Institute for Biological Studies where he directs the Computational Neurobiology Laboratory, an Investigator with the Howard Hughes Medical Institute, and a Professor of Biology and Computer Science and Engineering at the University of California, San Diego, where he is Director of the Institute for Neural Computation. The long-range goal of his laboratory is to understand the computational resources of brains and to build linking principles from brain to behavior using computational models. This goal is being pursued with a combination of theoretical and experimental approaches at several levels of investigation ranging from the biophysical level to the systems level. His laboratory has developed new methods for analyzing the sources for electrical and magnetic signals recorded from the scalp and hemodynamic signals from functional brain imaging by blind separation using independent components analysis (ICA). He has published over 300 scientific papers and 12 books, including The Computational Brain (MIT Press, 1994) with Patricia Churchland.

Dr. Sejnowski received the Wright Prize for Interdisciplinary Research in 1996, the Hebb Prize from the International Neural Network Society in 1999, and the IEEE Neural Network Pioneer Awardin 2002. His was elected an AAAS Fellow in 2006 and to the Institute of Medicine of the National Academies in 2008.

Appendix EM Algorithm for Training the GSMM

We present the details for the EM algorithm here. The parameters are estimated by maximizing the log-likelihood which is given by (10).

Expectation Step

When q(ξ_kt|s_t)q(s_t) equals to the posterior distribution, the cost Inline graphic (q, θ) equals to (θ) and is maximized. The q(ξ_kt|s_t) is computed as

\begin{array}{l} log q (ξ_{k t} ∣ s_{t}) = log p (X_{k t} ∣ ξ_{k t}) + log p (ξ_{k t} ∣ s_{t}) + c \\ = - ξ_{k t} - e^{- ξ_{k t}} {∣ X_{k t} ∣}^{2} - \frac{ν_{k s}}{2} {(ξ_{k t} - μ_{k s})}^{2} + c \end{array}

(42)

where c is a constant. There is no closed-form density, we use Laplace method [25] approximate q by a Gaussian

q (ξ_{k t} ∣ s_{t}) = N (ξ_{k t} ∣ {\bar{ξ}}_{{kts}_{t}}, φ_{k t s_{t}})

(43)

{\bar{ξ}}_{k t s_{t}} = {\hat{ξ}}_{k t s_{t}} + \frac{1}{φ_{k t s_{t}}} (e^{- {\hat{ξ}}_{k t s_{t}}} {∣ X_{k t} ∣}^{2} - ν_{{k s}_{t}} {\hat{ξ}}_{k t s_{t}} + ν_{k s_{t}} μ_{k s_{t}} - 1)

(44)

φ_{k t s_{t}} = e^{- {\hat{ξ}}_{k t s_{t}}} {∣ X_{k t} ∣}^{2} + ν_{k s_{t}} .

(45)

where ξ̂_{kts_t} is chosen to be the mode of the posterior and is iteratively updated by

{\bar{ξ}}_{k t s_{t}} \leftarrow {\bar{ξ}}_{k t s_{t}} + \frac{1}{φ_{k t s_{t}}} (e^{- {\bar{ξ}}_{k t s_{t}}} {∣ X_{k t} ∣}^{2} - ν_{k s_{t}} {\bar{ξ}}_{k t s_{t}} + ν_{k s_{t}} μ_{k s_{t}} - 1) .

(46)

This update rule is equivalent to maximizing log q(ξ_kt|s_t) using the Newton’s method

{\bar{ξ}}_{k t s_{t}} \leftarrow {\bar{ξ}}_{k t s_{t}} - \frac{{[log q (ξ_{k t} ∣ s_{t})]}_{ξ_{k t} = {\bar{ξ}}_{k t s_{t}}}^{'}}{{[log q (ξ_{k t} ∣ s_{t})]}_{ξ_{k t} = {\bar{ξ}}_{k t s_{t}}}^{″}} .

(47)

Take the derivative of Inline graphic (q, θ) with respect to q(s_t) and set it to zero, we can obtain the optimal q(s_t). Define

\begin{array}{l} f_{k t s_{t}} = \int q (ξ_{k t} ∣ s_{t}) (log p (X_{k t}, ξ_{k t} ∣ s_{t}) - log q (ξ_{k t} ∣ s_{t})) \\ = log \frac{\sqrt{ν_{k s_{t}}}}{π \sqrt{φ_{k t s_{t}}}} - e^{- {\bar{ξ}}_{k t s_{t}} + 1 / (2 φ_{k t s_{t}})} {∣ X_{k t} ∣}^{2} - {\bar{ξ}}_{k t s_{t}} - \frac{ν_{k s_{t}}}{2} (\frac{1}{φ_{k t s_{t}}} + {({\bar{ξ}}_{k t s_{t}} - μ_{k s_{t}})}^{2}) + \frac{1}{2} . \end{array}

(48)

Then q(s_t) can be obtained as

q (s_{t}) = \frac{exp (\sum_{k} f_{k t s_{t}}) p (s_{t})}{Z_{t}}

(49)

Z_{t} = \sum_{s_{t}} exp (\sum_{k} f_{k t s_{t}}) p (s_{t}) .

(50)

Maximization Step

The M-step optimizes Inline graphic (q, θ) over the model parameters θ

μ_{k s} = \frac{\sum_{t} q (s_{t} = s) ξ_{k t s_{t}}}{\sum_{t} q (s_{t} = s)}

(51)

\frac{1}{ν_{k s}} = \frac{\sum_{t} q (s_{t} = s) [{({\bar{ξ}}_{k t s_{t}} - μ_{k s})}^{2} + \frac{1}{φ_{k t s_{t}}}]}{\sum_{t} q (s_{t} = s)}

(52)

p (s) = \frac{\sum_{t} q (s_{t} = s)}{\sum_{t s} q (s_{t} = s)} .

(53)

The cost Inline graphic is computed as = Σ_t log(Z_t) which can be used empirically to monitor the convergence, because the is not guaranteed to increase due to the approximation in the E-step.

The parameters of a GMM trained in the log-spectral domain are used to initialize the EM algorithm. The E-step and M-step are iterated until convergence, which is very quick because ξ_k simulates the log-spectra.

Contributor Information

Jiucang Hao, Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037 USA, and also with the Institute for Neural Computation, University of California, San Diego, CA 92093 USA.

Te-Won Lee, Qualcomm, Inc., San Diego, CA 92121 USA.

Terrence J. Sejnowski, Howard Hughes Medical Institute and Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037 USA, and also with the Division of Biological Sciences, University of California, San Diego, CA 92093 USA.

References

1.Ephraim Y, Cohen I. The Electrical Engineering Handbook. Boca Raton, FL: CRC; 2006. Recent advancements in speech enhancement. [Google Scholar]
2.Attias H, Platt JC, Acero A, Deng L. Speech denoising and dereverberation using probabilistic models. in Proc NIPS. 2000:758–764. [Google Scholar]
3.Gannot S, Burshtein D, Weinstein E. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process. 2001 Aug;49(8):1614–1626. [Google Scholar]
4.Cohen I, Gannot S, Berdugo B. An integrated real-time beamforming and postfiltering system for nonstationary noise environments. EURASIP J Appl Signal Process. 2003;11:1064–1073. [Google Scholar]
5.Boll SF. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust, Speech, Signal Process. 1979 Apr;ASSP-27(2):113–120. [Google Scholar]
6.Ephraim Y, Trees HLV. A signal subspace approach for speech enhancement. IEEE Trans Speech Audio Process. 1995 Jul;3(4):251–266. [Google Scholar]
7.Ephraim Y. Statistical-model-based speech enhancement systems. Proc IEEE. 1992 Oct;80(10):1526–1555. [Google Scholar]
8.Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoust, Speech, Signal Process. 1984;ASSP-32(6):1109–1121. [Google Scholar]
9.Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoust, Speech, Signal Process. 1985 Apr;33(2):443–445. [Google Scholar]
10.Martin R. Speech enhancement based on minimum mean-square error estimation and supergaussian priors. IEEE Trans Speech Audio Process. 2005 Sep;13(5):845–856. [Google Scholar]
11.Burshtein D, Gannot S. Speech enhancement using a mixture-maximum model. IEEE Trans Speech Audio Process. 2002 Sep;10(6):341–351. [Google Scholar]
12.Frey B, Kristjansson T, Deng L, Acero A. Learning dynamic noise models from noisy speech for robust speech recognition. Proc NIPS. 2001:1165–1171. [Google Scholar]
13.Kristjansson T, Hershey J. High resolution signal reconstruction. Proc IEEE Workshop ASRU. 2003:291–296. [Google Scholar]
14.Hopgood JR, Rayner PJ. Single channel nonstationary stochastic signal separation using linear time-varying filters. IEEE Trans Signal Process. 2003 Jul;51(7):1739–1752. [Google Scholar]
15.Czyzewski A, Krolikowski R. Noise reduction in audio signals based on the perceptual coding approach. Proc IEEE WASPAA. 1999:147–150. [Google Scholar]
16.Lee J-H, Jung H-J, Lee T-W, Lee S-Y. Speech coding and noise reduction using ica-based speech features. Proc Workshop ICA. 2000:417–422. [Google Scholar]
17.Wolfe P, Godsill S. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement. Proc ICASSP. 2000;2:821–824. [Google Scholar]
18.Ephraim Y. A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans Signal Process. 1992 Apr;40(4):725–735. [Google Scholar]
19.Ephraim Y. Gain-adapted hidden Markov models for recognition of clean and noisy speech. IEEE Trans Signal Process. 1992 Jun;40(6):1303–1316. [Google Scholar]
20.Bishop CM. Neural Networks for Pattern Recognition. New York: Oxford Univ. Press; 1995. [Google Scholar]
21.Andrews D, Mallows C. Scale mixture of normal distributions. J R Statist Soc. 1974;36(1):99–102. [Google Scholar]
22.Wolfe P, Godsill S, Ng W. Bayesian variable selection and regularization for time-frequency surface estimation. J R Statist Soc. 2004;66(3):575–589. [Google Scholar]
23.Fevotte C, Godsill S. A Bayesian approach for blind separation of sparse sources. IEEE Trans Audio, Speech, Lang Process. 2006 Dec;14(6):2174–2188. [Google Scholar]
24.Vincent E, Plumbley M. Low bit-rate object coding of musical audio using Bayesian harmonic models. IEEE Trans Audio, Speech, Lang Process. 2007 May;15(4):1273–1282. [Google Scholar]
25.Azevedo-Filho A, Shachter RD. Laplace’s method approximations for probabilistic inference in belief networks with continuous variables. Proc UAI. 1994:28–36. [Google Scholar]
26.Attias H. A variational Bayesian framework for graphical models. Proc NIPS. 2000;12:209–215. [Google Scholar]
27.Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the em algorithm. J R Statist Soc. 1977;39(1):1–38. [Google Scholar]
28.Cover TM, Thomas JA. Elements of Information Theory. New York: Wiley-Interscience; 1991. [Google Scholar]
29.Cooke M, Lee T-W. Speech Separation Challenge. [Online]. Available: http://www.dcs.shef.ac.uk/~martin/SpeechSeparationChallenge.html.
30.Wolfe P. Example of Short-Time Spectral Attenuation. [Online]. Available: http://www.eecs.harvard.edu/~patrick/research/stsa.html.
31.Cohen I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans Speech Audio Process. 2003 Sep;11(5):466–475. [Google Scholar]
32.Cohen I, Berdugo B. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process Lett. 2002 Jan;9(1):12–15. [Google Scholar]
33.McAulay R, Malpass M. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans Acoust, Speech, Signal Process. 1980 Apr;ASSP-28(2):137–145. [Google Scholar]
34.Martin R. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans Speech Audio Process. 2001 Jul;9(5):504–512. [Google Scholar]
35.Wang D, Lim J. The unimportance of phase in speech enhancement. IEEE Trans Acoust, Speech, Signal Process. 1982 Aug;ASSP-30(4):679–681. [Google Scholar]
36.Attias H, Deng L, Acero A, Platt J. A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise. Proc Eurospeech. 2001:1903–1906. [Google Scholar]
37.Brandstein MS. On the use of explicit speech modeling in microphone array applications. Proc ICASSP. 1998:3613–3616. [Google Scholar]
38.Hong L, Rosca J, Balan R. Independent component analysis based single channel speech enhancement. Proc ISSPIT. 2003:522–525. [Google Scholar]
39.Beaugeant C, Scalart P. Speech enhancement using a minimum least-squares amplitude estimator. Proc IWAENC. 2001:191–194. [Google Scholar]
40.Lotter T, Vary P. Noise reduction by maximum a posterior spectral amplitude estimation with supergaussian speech modeling. Proc IWAENC. 2003:83–86. [Google Scholar]
41.Breithaupt C, Martin R. Mmse estimation of magnitude-squared dft coefficoents with supergaussian priors. Proc ICASSP. 2003:848–851. [Google Scholar]
42.Benesty J, Chen J, Huang Y, Doclo S. Study of the wiener filter for noise reduction. In: Benesty J, Makino S, Chen J, editors. Speech Enhancement. New York: Springer; 2005. pp. 9–42. [Google Scholar]

[R1] 1.Ephraim Y, Cohen I. The Electrical Engineering Handbook. Boca Raton, FL: CRC; 2006. Recent advancements in speech enhancement. [Google Scholar]

[R2] 2.Attias H, Platt JC, Acero A, Deng L. Speech denoising and dereverberation using probabilistic models. in Proc NIPS. 2000:758–764. [Google Scholar]

[R3] 3.Gannot S, Burshtein D, Weinstein E. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process. 2001 Aug;49(8):1614–1626. [Google Scholar]

[R4] 4.Cohen I, Gannot S, Berdugo B. An integrated real-time beamforming and postfiltering system for nonstationary noise environments. EURASIP J Appl Signal Process. 2003;11:1064–1073. [Google Scholar]

[R5] 5.Boll SF. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust, Speech, Signal Process. 1979 Apr;ASSP-27(2):113–120. [Google Scholar]

[R6] 6.Ephraim Y, Trees HLV. A signal subspace approach for speech enhancement. IEEE Trans Speech Audio Process. 1995 Jul;3(4):251–266. [Google Scholar]

[R7] 7.Ephraim Y. Statistical-model-based speech enhancement systems. Proc IEEE. 1992 Oct;80(10):1526–1555. [Google Scholar]

[R8] 8.Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoust, Speech, Signal Process. 1984;ASSP-32(6):1109–1121. [Google Scholar]

[R9] 9.Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoust, Speech, Signal Process. 1985 Apr;33(2):443–445. [Google Scholar]

[R10] 10.Martin R. Speech enhancement based on minimum mean-square error estimation and supergaussian priors. IEEE Trans Speech Audio Process. 2005 Sep;13(5):845–856. [Google Scholar]

[R11] 11.Burshtein D, Gannot S. Speech enhancement using a mixture-maximum model. IEEE Trans Speech Audio Process. 2002 Sep;10(6):341–351. [Google Scholar]

[R12] 12.Frey B, Kristjansson T, Deng L, Acero A. Learning dynamic noise models from noisy speech for robust speech recognition. Proc NIPS. 2001:1165–1171. [Google Scholar]

[R13] 13.Kristjansson T, Hershey J. High resolution signal reconstruction. Proc IEEE Workshop ASRU. 2003:291–296. [Google Scholar]

[R14] 14.Hopgood JR, Rayner PJ. Single channel nonstationary stochastic signal separation using linear time-varying filters. IEEE Trans Signal Process. 2003 Jul;51(7):1739–1752. [Google Scholar]

[R15] 15.Czyzewski A, Krolikowski R. Noise reduction in audio signals based on the perceptual coding approach. Proc IEEE WASPAA. 1999:147–150. [Google Scholar]

[R16] 16.Lee J-H, Jung H-J, Lee T-W, Lee S-Y. Speech coding and noise reduction using ica-based speech features. Proc Workshop ICA. 2000:417–422. [Google Scholar]

[R17] 17.Wolfe P, Godsill S. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement. Proc ICASSP. 2000;2:821–824. [Google Scholar]

[R18] 18.Ephraim Y. A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans Signal Process. 1992 Apr;40(4):725–735. [Google Scholar]

[R19] 19.Ephraim Y. Gain-adapted hidden Markov models for recognition of clean and noisy speech. IEEE Trans Signal Process. 1992 Jun;40(6):1303–1316. [Google Scholar]

[R20] 20.Bishop CM. Neural Networks for Pattern Recognition. New York: Oxford Univ. Press; 1995. [Google Scholar]

[R21] 21.Andrews D, Mallows C. Scale mixture of normal distributions. J R Statist Soc. 1974;36(1):99–102. [Google Scholar]

[R22] 22.Wolfe P, Godsill S, Ng W. Bayesian variable selection and regularization for time-frequency surface estimation. J R Statist Soc. 2004;66(3):575–589. [Google Scholar]

[R23] 23.Fevotte C, Godsill S. A Bayesian approach for blind separation of sparse sources. IEEE Trans Audio, Speech, Lang Process. 2006 Dec;14(6):2174–2188. [Google Scholar]

[R24] 24.Vincent E, Plumbley M. Low bit-rate object coding of musical audio using Bayesian harmonic models. IEEE Trans Audio, Speech, Lang Process. 2007 May;15(4):1273–1282. [Google Scholar]

[R25] 25.Azevedo-Filho A, Shachter RD. Laplace’s method approximations for probabilistic inference in belief networks with continuous variables. Proc UAI. 1994:28–36. [Google Scholar]

[R26] 26.Attias H. A variational Bayesian framework for graphical models. Proc NIPS. 2000;12:209–215. [Google Scholar]

[R27] 27.Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the em algorithm. J R Statist Soc. 1977;39(1):1–38. [Google Scholar]

[R28] 28.Cover TM, Thomas JA. Elements of Information Theory. New York: Wiley-Interscience; 1991. [Google Scholar]

[R29] 29.Cooke M, Lee T-W. Speech Separation Challenge. [Online]. Available: http://www.dcs.shef.ac.uk/~martin/SpeechSeparationChallenge.html.

[R30] 30.Wolfe P. Example of Short-Time Spectral Attenuation. [Online]. Available: http://www.eecs.harvard.edu/~patrick/research/stsa.html.

[R31] 31.Cohen I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans Speech Audio Process. 2003 Sep;11(5):466–475. [Google Scholar]

[R32] 32.Cohen I, Berdugo B. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process Lett. 2002 Jan;9(1):12–15. [Google Scholar]

[R33] 33.McAulay R, Malpass M. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans Acoust, Speech, Signal Process. 1980 Apr;ASSP-28(2):137–145. [Google Scholar]

[R34] 34.Martin R. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans Speech Audio Process. 2001 Jul;9(5):504–512. [Google Scholar]

[R35] 35.Wang D, Lim J. The unimportance of phase in speech enhancement. IEEE Trans Acoust, Speech, Signal Process. 1982 Aug;ASSP-30(4):679–681. [Google Scholar]

[R36] 36.Attias H, Deng L, Acero A, Platt J. A new method for speech denoising and robust speech recognition using probabilistic models for clean speech and for noise. Proc Eurospeech. 2001:1903–1906. [Google Scholar]

[R37] 37.Brandstein MS. On the use of explicit speech modeling in microphone array applications. Proc ICASSP. 1998:3613–3616. [Google Scholar]

[R38] 38.Hong L, Rosca J, Balan R. Independent component analysis based single channel speech enhancement. Proc ISSPIT. 2003:522–525. [Google Scholar]

[R39] 39.Beaugeant C, Scalart P. Speech enhancement using a minimum least-squares amplitude estimator. Proc IWAENC. 2001:191–194. [Google Scholar]

[R40] 40.Lotter T, Vary P. Noise reduction by maximum a posterior spectral amplitude estimation with supergaussian speech modeling. Proc IWAENC. 2003:83–86. [Google Scholar]

[R41] 41.Breithaupt C, Martin R. Mmse estimation of magnitude-squared dft coefficoents with supergaussian priors. Proc ICASSP. 2003:848–851. [Google Scholar]

[R42] 42.Benesty J, Chen J, Huang Y, Doclo S. Study of the wiener filter for noise reduction. In: Benesty J, Makino S, Chen J, editors. Speech Enhancement. New York: Springer; 2005. pp. 9–42. [Google Scholar]

PERMALINK

Speech Enhancement Using Gaussian Scale Mixture Models

Jiucang Hao

Te-Won Lee

Terrence J Sejnowski

Roles

Abstract

I. Introduction

Notation

II. Gaussian Scale Mixture Model

A. Acoustic Model

B. Improperness of the Log-Normal Distribution for Xk

Fig. 1.

C. Gaussian Scale Mixture Model for Speech Prior

III. EM Algorithm for Training the GSMM

IV. Two Signal Estimation Approaches

A. Laplace Method for Signal Estimation

B. Variational Approximation for Signal Estimation

V. Experiments

A. Dataset Description

Fig. 2.

B. Training the Gaussian Scale Mixture Model

C. Benchmarks for Comparison

1) Wiener Filter

2) STSA

3) Perceptual Model

4) Linear Approximation

5) Super-Gaussian Prior

D. Comparison Criteria

1) Signal-to-Noise Ratio (SNR)

2) Word Recognition Error Rate

E. Results

1) Speech Shaped Noise

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

2) White Gaussian Noise

Fig. 7.

Fig. 8.

VI. Conclusion

Acknowledgments

Biographies

Appendix EM Algorithm for Training the GSMM

Expectation Step

Maximization Step

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

B. Improperness of the Log-Normal Distribution for X_k